本篇博文主要内容为 2025-05-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-20)

今日共更新1281篇论文,其中:

  • 自然语言处理254篇(Computation and Language (cs.CL))
  • 人工智能438篇(Artificial Intelligence (cs.AI))
  • 计算机视觉266篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习441篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] CIE: Controlling Language Model Text Generations Using Continuous Signals

【速读】: 该论文试图解决如何有效控制语言模型(Language Models, LMs)生成内容的属性问题,例如响应长度、语言复杂度、情感和语气等。现有方法通常依赖于自然语言提示或离散控制信号,但这些方法存在脆弱性和可扩展性差的问题。论文提出的解决方案的关键在于使用连续控制信号,通过在“低”和“高”标记嵌入之间进行插值生成向量,从而实现对语言模型行为的精确控制。实验表明,该方法在响应长度控制方面比基于上下文学习或离散信号微调的方法更为可靠。

链接: https://arxiv.org/abs/2505.13448
作者: Vinay Samuel,Harshita Diddee,Yiming Zhang,Daphne Ippolito
机构: University of Maryland, College Park(马里兰大学学院公园分校); Carnegie Mellon University(卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Aligning language models with user intent is becoming increasingly relevant to enhance user experience. This calls for designing methods that can allow users to control the properties of the language that LMs generate. For example, controlling the length of the generation, the complexity of the language that gets chosen, the sentiment, tone, etc. Most existing work attempts to integrate users’ control by conditioning LM generations on natural language prompts or discrete control signals, which are often brittle and hard to scale. In this work, we are interested in \textitcontinuous control signals, ones that exist along a spectrum that can’t easily be captured in a natural language prompt or via existing techniques in conditional generation. Through a case study in controlling the precise response-length of generations produced by LMs, we demonstrate how after fine-tuning, behaviors of language models can be controlled via continuous signals – as vectors that are interpolated between a “low” and a “high” token embedding. Our method more reliably exerts response-length control than in-context learning methods or fine-tuning methods that represent the control signal as a discrete signal. Our full open-sourced code and datasets are available at this https URL.
zh

[NLP-1] rust But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中存在的“表面自我反思”问题,即模型无法稳健地验证自身输出的正确性。解决方案的关键在于提出一种名为RISE(Reinforcing Reasoning with Self-Verification)的在线强化学习框架,该框架通过在一个统一的强化学习过程中同时训练模型提升其问题求解能力和自我验证能力,利用可验证奖励机制为解决方案生成和自我验证任务提供实时反馈。

链接: https://arxiv.org/abs/2505.13445
作者: Xiaoyuan Liu,Tian Liang,Zhiwei He,Jiahao Xu,Wenxuan Wang,Pinjia He,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: Tencent(腾讯); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: code available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection’', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model’s problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.
zh

[NLP-2] ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

【速读】: 该论文旨在解决大型视觉-语言模型(LVLM)在图表理解任务中存在视觉推理能力不足的问题,即当前模型在处理需要复杂视觉推理的任务时表现不佳。其解决方案的关键在于构建一个名为ChartMuseum的新基准测试集,该基准包含1,162个由专家标注的问题,涵盖多种推理类型,并来源于184个真实世界的图表,专门用于评估复杂的视觉与文本推理能力。通过这一基准,研究揭示了模型与人类在图表理解上的显著性能差距,并有效区分了不同模型的能力。

链接: https://arxiv.org/abs/2505.13444
作者: Liyan Tang,Grace Kim,Xinyu Zhao,Thom Lake,Wenxuan Ding,Fangcong Yin,Prasann Singhal,Manya Wadhwa,Zeyu Leo Liu,Zayne Sprague,Ramya Namuduri,Bodun Hu,Juan Diego Rodriguez,Puyuan Peng,Greg Durrett
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks – where frontier models perform similarly and near saturation – our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
zh

[NLP-3] Optimizing Anytime Reasoning via Budget Relative Policy Optimization

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推理能力提升过程中,由于固定token预算导致的训练和部署效率低下问题。现有方法通过强化学习(Reinforcement Learning, RL)优化最终性能,但无法灵活适应不同的token预算约束。其解决方案的关键在于提出一种名为AnytimeReasoner的新框架,该框架通过截断完整的思考过程并适配采样的token预算,引入可验证的密集奖励,从而实现更有效的信用分配,并通过解耦优化思考和摘要策略以及引入Budget Relative Policy Optimization (BRPO) 技术提升学习过程的鲁棒性和效率。

链接: https://arxiv.org/abs/2505.13438
作者: Penghui Qi,Zichen Liu,Tianyu Pang,Chao Du,Wee Sun Lee,Min Lin
机构: Sea AI Lab; National University of Singapore
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.
zh

[NLP-4] SMOTExT: SMOTE meets Large Language Models

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型训练中数据稀缺和类别不平衡的问题,特别是在专业领域或低资源环境下的挑战。其解决方案的关键在于提出一种名为SMOTExT的新技术,该技术将合成少数类过采样技术(Synthetic Minority Over-sampling Technique, SMOTE)适配到文本数据,通过在基于BERT的嵌入向量之间进行插值,并利用xRAG架构将生成的潜在向量解码为连贯文本,从而生成新的合成样本。这种方法展示了在少样本设置下知识蒸馏和数据增强的潜力,同时在隐私保护机器学习方面也表现出一定的前景。

链接: https://arxiv.org/abs/2505.13434
作者: Mateusz Bystroński,Mikołaj Hołysz,Grzegorz Piotrowski,Nitesh V. Chawla,Tomasz Kajdanowicz
机构: Wrocław University of Science and Technology (弗罗茨瓦夫科技大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data scarcity and class imbalance are persistent challenges in training robust NLP models, especially in specialized domains or low-resource settings. We propose a novel technique, SMOTExT, that adapts the idea of Synthetic Minority Over-sampling (SMOTE) to textual data. Our method generates new synthetic examples by interpolating between BERT-based embeddings of two existing examples and then decoding the resulting latent point into text with xRAG architecture. By leveraging xRAG’s cross-modal retrieval-generation framework, we can effectively turn interpolated vectors into coherent text. While this is preliminary work supported by qualitative outputs only, the method shows strong potential for knowledge distillation and data augmentation in few-shot settings. Notably, our approach also shows promise for privacy-preserving machine learning: in early experiments, training models solely on generated data achieved comparable performance to models trained on the original dataset. This suggests a viable path toward safe and effective learning under data protection constraints.
zh

[NLP-5] Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

【速读】: 该论文旨在解决大规模语言模型在微调过程中因GPU内存不足而导致的瓶颈问题,通过最小化模型权重、梯度和优化器状态的内存占用来实现更高效的训练。其解决方案的关键在于提出了一种名为量化零阶优化(Quantized Zeroth-order Optimization, QZO)的新方法,该方法通过扰动连续量化尺度进行梯度估计,并采用方向导数裁剪方法以稳定训练过程,从而避免了量化权重与连续梯度之间的精度差异问题。

链接: https://arxiv.org/abs/2505.13430
作者: Sifeng Shang,Jiayi Zhou,Chenyu Lin,Minxian Li,Kaiyang Zhou
机构: Hong Kong Baptist University (香港浸会大学); Nanjing University of Science and Technology (南京理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose Quantized Zeroth-order Optimization (QZO), a novel approach that perturbs the continuous quantization scale for gradient estimation and uses a directional derivative clipping method to stabilize training. QZO is orthogonal to both scalar-based and codebook-based post-training quantization methods. Compared to full-parameter fine-tuning in bfloat16, QZO can reduce the total memory cost by more than 18 \times for 4-bit LLMs, and enables fine-tuning Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.
zh

[NLP-6] Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness

【速读】: 该论文试图解决非专业人士在语言层面识别认知衰退(dementia)能力有限的问题,以及如何利用生成式 AI (Generative AI) 提升对语言特征的感知能力。其解决方案的关键在于引入一种可解释的方法,通过 LLMs 提取由专家指导的高层次语言特征,并利用逻辑回归模型模拟人类和 LLM 的感知过程,进而与临床诊断进行对比分析。该方法揭示了人类感知的不一致性及依赖有限线索的局限性,同时展示了 LLM 在捕捉更丰富、更贴近临床模式的语言特征方面的优势。

链接: https://arxiv.org/abs/2505.13418
作者: Lotem Peled-Cohen,Maya Zadok,Nitay Calderon,Hila Gonen,Roi Reichart
机构: Technion (以色列理工学院); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cognitive decline often surfaces in language years before diagnosis. It is frequently non-experts, such as those closest to the patient, who first sense a change and raise concern. As LLMs become integrated into daily communication and used over prolonged periods, it may even be an LLM that notices something is off. But what exactly do they notice–and should be noticing–when making that judgment? This paper investigates how dementia is perceived through language by non-experts. We presented transcribed picture descriptions to non-expert humans and LLMs, asking them to intuitively judge whether each text was produced by someone healthy or with dementia. We introduce an explainable method that uses LLMs to extract high-level, expert-guided features representing these picture descriptions, and use logistic regression to model human and LLM perceptions and compare with clinical diagnoses. Our analysis reveals that human perception of dementia is inconsistent and relies on a narrow, and sometimes misleading, set of cues. LLMs, by contrast, draw on a richer, more nuanced feature set that aligns more closely with clinical patterns. Still, both groups show a tendency toward false negatives, frequently overlooking dementia cases. Through our interpretable framework and the insights it provides, we hope to help non-experts better recognize the linguistic signs that matter.
zh

[NLP-7] AdaptThink: Reasoning Models Can Learn When to Think

【速读】: 该论文试图解决大型推理模型在执行任务时因冗长的思考过程导致的推理效率低下问题(inference overhead)。其关键解决方案是提出一种名为AdaptThink的新型强化学习算法,该算法通过自适应选择是否进行深度思考(Thinking)或直接生成答案(NoThinking),以优化推理质量与效率之间的平衡。AdaptThink的核心在于两个组成部分:一是约束优化目标,鼓励模型在保持整体性能的前提下优先选择NoThinking;二是重要性采样策略,用于在策略训练中平衡Thinking和NoThinking样本,从而实现冷启动并促进两种思考模式的探索与利用。

链接: https://arxiv.org/abs/2505.13417
作者: Jiajie Zhang,Nianyi Lin,Lei Hou,Ling Feng,Juanzi Li
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at this https URL.
zh

[NLP-8] CoT-Kinetics: A Theoretical Modeling Assessing LRM Reasoning Process

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRM)在生成包含推理轨迹和答案的输出时,如何准确评估整体输出质量的问题。现有方法虽然考虑了推理部分与答案的联合评估,但未能有效反映推理过程对最终答案的因果关系。论文提出的解决方案关键在于引入一种基于经典力学的新型方法——思维链动力学能量方程(CoT-Kinetics energy equation),该方程将模型内部变换过程建模为类似粒子动力学的行为,通过赋予一个标量分数来专门评估推理阶段的合理性,从而更精确地衡量最终答案的可信度。

链接: https://arxiv.org/abs/2505.13408
作者: Jinhe Bi,Danqi Yan,Yifan Wang,Wenke Huang,Haokun Chen,Guancheng Wan,Mang Ye,Xun Xiao,Hinrich Schuetze,Volker Tresp,Yunpu Ma
机构: Munich Research Center, Huawei Technologies (慕尼黑华为技术研究中心); School of Computer Science, Wuhan University (武汉大学计算机学院); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent Large Reasoning Models significantly improve the reasoning ability of Large Language Models by learning to reason, exhibiting the promising performance in solving complex tasks. LRMs solve tasks that require complex reasoning by explicitly generating reasoning trajectories together with answers. Nevertheless, judging the quality of such an output answer is not easy because only considering the correctness of the answer is not enough and the soundness of the reasoning trajectory part matters as well. Logically, if the soundness of the reasoning part is poor, even if the answer is correct, the confidence of the derived answer should be low. Existing methods did consider jointly assessing the overall output answer by taking into account the reasoning part, however, their capability is still not satisfactory as the causal relationship of the reasoning to the concluded answer cannot properly reflected. In this paper, inspired by classical mechanics, we present a novel approach towards establishing a CoT-Kinetics energy equation. Specifically, our CoT-Kinetics energy equation formulates the token state transformation process, which is regulated by LRM internal transformer layers, as like a particle kinetics dynamics governed in a mechanical field. Our CoT-Kinetics energy assigns a scalar score to evaluate specifically the soundness of the reasoning phase, telling how confident the derived answer could be given the evaluated reasoning. As such, the LRM’s overall output quality can be accurately measured, rather than a coarse judgment (e.g., correct or incorrect) anymore.
zh

[NLP-9] Granary: Speech Recognition and Translation Dataset in 25 European Languages INTERSPEECH2025

【速读】: 该论文试图解决低资源语言语音处理中数据稀缺的问题,尽管多任务和多语言方法对大型模型有益,但这一领域仍缺乏深入研究。解决方案的关键在于构建一个大规模的语音数据集(Granary),涵盖25种欧洲语言,并通过伪标签管道进行数据质量增强,包括分割、两次推理、幻觉过滤和标点恢复,随后利用EuroLLM生成翻译对并经过数据筛选流程,从而高效地处理大量数据并提升模型性能。

链接: https://arxiv.org/abs/2505.13404
作者: Nithin Rao Koluguri,Monica Sekoyan,George Zelenfroynd,Sasha Meister,Shuoyang Ding,Sofia Kostandian,He Huang,Nikolay Karpov,Jagadeesh Balam,Vitaly Lavrukhin,Yifan Peng,Sara Papi,Marco Gaido,Alessio Brutti,Boris Ginsburg
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at this https URL
zh

[NLP-10] MR. Judge: Multimodal Reason er as a Judge

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在作为评估判官时缺乏强推理能力和有效标注数据的问题。其解决方案的关键在于提出一种名为MR. Judge的范式,通过将评估过程建模为一种受推理启发的多项选择问题,使判官模型能够全面分析响应的不同方面并选出最佳答案,从而提升评估的可解释性和性能。此外,为应对缺乏带分数响应的问题,论文引入了逆向响应候选生成和基于文本的推理提取策略,以实现自动标注并增强MLLM判官的复杂推理能力。

链接: https://arxiv.org/abs/2505.13403
作者: Renjie Pi,Felix Bai,Qibin Chen,Simon Wang,Jiulong Shan,Kieran Liu,Meng Cao
机构: HKUST(香港科技大学); Apple(苹果)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paradigm of using Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) as evaluative judges has emerged as an effective approach in RLHF and inference-time scaling. In this work, we propose Multimodal Reasoner as a Judge (MR. Judge), a paradigm for empowering general-purpose MLLMs judges with strong reasoning capabilities. Instead of directly assigning scores for each response, we formulate the judgement process as a reasoning-inspired multiple-choice problem. Specifically, the judge model first conducts deliberate reasoning covering different aspects of the responses and eventually selects the best response from them. This reasoning process not only improves the interpretibility of the judgement, but also greatly enhances the performance of MLLM judges. To cope with the lack of questions with scored responses, we propose the following strategy to achieve automatic annotation: 1) Reverse Response Candidates Synthesis: starting from a supervised fine-tuning (SFT) dataset, we treat the original response as the best candidate and prompt the MLLM to generate plausible but flawed negative candidates. 2) Text-based reasoning extraction: we carefully design a data synthesis pipeline for distilling the reasoning capability from a text-based reasoning model, which is adopted to enable the MLLM judges to regain complex reasoning ability via warm up supervised fine-tuning. Experiments demonstrate that our MR. Judge is effective across a wide range of tasks. Specifically, our MR. Judge-7B surpasses GPT-4o by 9.9% on VL-RewardBench, and improves performance on MM-Vet during inference-time scaling by up to 7.7%.
zh

[NLP-11] A Minimum Description Length Approach to Regularization in Neural Networks

【速读】: 该论文试图解决神经网络在训练过程中难以收敛到精确符号解的问题,而非仅仅得到近似解。研究表明,传统的正则化方法(如L₁、L₂或无正则化)在处理形式语言时会导致模型远离完美的初始解,从而无法获得准确的解决方案。论文提出的解决方案之关键在于引入最小描述长度(Minimum Description Length, MDL)原则,通过平衡模型复杂度与数据拟合程度,提供一种理论上有依据的正则化方法,使精确解优于近似解,且不依赖于具体的优化算法。

链接: https://arxiv.org/abs/2505.13398
作者: Matan Abudy,Orr Well,Emmanuel Chemla,Roni Katzir,Nur Lan
机构: Tel Aviv University (特拉维夫大学); École Normale Supérieure (法国高等师范学校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:State-of-the-art neural networks can be trained to become remarkable solutions to many problems. But while these architectures can express symbolic, perfect solutions, trained models often arrive at approximations instead. We show that the choice of regularization method plays a crucial role: when trained on formal languages with standard regularization ( L_1 , L_2 , or none), expressive architectures not only fail to converge to correct solutions but are actively pushed away from perfect initializations. In contrast, applying the Minimum Description Length (MDL) principle to balance model complexity with data fit provides a theoretically grounded regularization method. Using MDL, perfect solutions are selected over approximations, independently of the optimization algorithm. We propose that unlike existing regularization techniques, MDL introduces the appropriate inductive bias to effectively counteract overfitting and promote generalization.
zh

[NLP-12] IG Parser: A Software Package for the Encoding of Institutional Statements using the Institutional Grammar

【速读】: 该论文旨在解决对制度性系统(institutions)进行定性内容分析的挑战,特别是针对正式(如法律规则)和非正式(如社会规范)规范及策略(如惯例)的复杂性。解决方案的关键在于IG Parser软件所采用的独特语法——IG Script,该语法基于制度语法(Institutional Grammar)及其最新版本Institutional Grammar 2.0,实现了自然语言的严格编码,并支持多种下游分析技术的格式转换,从而提升了制度系统分析的效率与准确性。

链接: https://arxiv.org/abs/2505.13393
作者: Christopher K. Frantz
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages

点击查看摘要

Abstract:This article provides an overview of IG Parser, a software that facilitates qualitative content analysis of formal (e.g., legal) rules or informal (e.g., socio-normative) norms, and strategies (such as conventions) – referred to as \emphinstitutions – that govern social systems and operate configurally to describe \emphinstitutional systems. To this end, the IG Parser employs a distinctive syntax that ensures rigorous encoding of natural language, while automating the transformation into various formats that support the downstream analysis using diverse analytical techniques. The conceptual core of the IG Parser is an associated syntax, IG Script, that operationalizes the conceptual foundations of the Institutional Grammar, and more specifically Institutional Grammar 2.0, an analytical paradigm for institutional analysis. This article presents the IG Parser, including its conceptual foundations, syntactic specification of IG Script, alongside architectural principles. This introduction is augmented with selective illustrative examples that highlight the use and benefit associated with the tool.
zh

[NLP-13] R3: Robust Rubric-Agnostic Reward Models

【速读】: 该论文旨在解决现有奖励模型在可控性和可解释性方面的不足,以及其在更广泛下游任务中的泛化能力有限的问题。传统奖励模型通常针对特定目标进行优化,导致其在面对多样化任务时表现受限,且其标量输出缺乏上下文推理支持,难以解释。论文提出的解决方案是R3框架,其关键在于实现与评价维度无关的通用性,并提供可解释且基于推理的评分分配,从而增强语言模型评估的透明度和灵活性,支持与多种人类价值观和应用场景的对齐。

链接: https://arxiv.org/abs/2505.13388
作者: David Anugraha,Zilu Tang,Lester James V. Miranda,Hanyang Zhao,Mohammad Rifqi Farhansyah,Garry Kuwanto,Derry Wijaya,Genta Indra Winata
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce R3, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. R3 enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at this https URL
zh

[NLP-14] CompeteSMoE – Statistically Guaranteed Mixture of Experts Training via Competition

【速读】: 该论文旨在解决稀疏专家混合模型(Sparse Mixture of Experts, SMoE)在训练过程中由于路由过程不优化而导致的效率问题,即专家执行计算却无法直接参与路由决策,从而影响整体性能。其解决方案的关键在于提出一种名为“竞争”(Competition)的新机制,通过让令牌路由到具有最高神经响应的专家,提升路由效率。理论分析表明,该机制相比传统softmax路由具有更高的样本效率,而基于此机制的CompeteSMoE算法则通过部署路由网络学习竞争策略,在保持低训练开销的同时实现了强大的性能表现。

链接: https://arxiv.org/abs/2505.13380
作者: Nam V. Nguyen,Huy Nguyen,Quang Pham,Van Nguyen,Savitha Ramasamy,Nhat Ho
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 52 pages. This work is an improved version of the previous study at arXiv:2402.02526

点击查看摘要

Abstract:Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network’s depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: this https URL. This work is an improved version of the previous study at arXiv:2402.02526
zh

[NLP-15] hinkless: LLM Learns When to Think

【速读】: 该论文试图解决在需要复杂逻辑推理的任务中,传统推理语言模型(Reasoning Language Models)因对所有查询都进行详尽的链式推理而导致的计算效率低下问题。解决方案的关键在于提出一种名为Thinkless的可学习框架,该框架使模型能够根据任务复杂度和自身能力自适应地选择简短回答或详细推理模式,其核心是基于强化学习的解耦组相对策略优化(DeGRPO)算法,通过分离控制标记损失和响应损失,实现对混合推理目标的细粒度控制,从而提升模型效率并稳定训练过程。

链接: https://arxiv.org/abs/2505.13379
作者: Gongfan Fang,Xinyin Ma,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model’s ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, short for concise responses and think for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at this https URL
zh

[NLP-16] What Prompts Dont Say: Understanding and Managing Underspecification in LLM Prompts

【速读】: 该论文试图解决在构建基于大语言模型(Large Language Model, LLM)的软件时,开发者通过自然语言表达的需求往往存在不明确(underspecification)的问题,导致模型无法准确理解用户的重要需求。研究发现,尽管LLM在默认情况下可以猜测41.1%的未明确需求,但这种行为不够稳健,未明确提示在模型或提示变化时更容易出现性能退化,有时准确率下降超过20%。解决方案的关键在于引入新型的面向需求的提示优化机制,该机制能够平均提升基线方法4.8%的性能,而非简单地在提示中添加更多需求。此外,作者还提出需要更广泛的流程来管理提示的不明确性,包括主动的需求发现、评估和监控。

链接: https://arxiv.org/abs/2505.13360
作者: Chenyang Yang,Yike Shi,Qianou Ma,Michael Xieyang Liu,Christian Kästner,Tongshuang Wu
机构: Carnegie Mellon University (卡内基梅隆大学); Google Deepmind (谷歌深度思维)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Building LLM-powered software requires developers to communicate their requirements through natural language, but developer prompts are frequently underspecified, failing to fully capture many user-important requirements. In this paper, we present an in-depth analysis of prompt underspecification, showing that while LLMs can often (41.1%) guess unspecified requirements by default, such behavior is less robust: Underspecified prompts are 2x more likely to regress over model or prompt changes, sometimes with accuracy drops by more than 20%. We then demonstrate that simply adding more requirements to a prompt does not reliably improve performance, due to LLMs’ limited instruction-following capabilities and competing constraints, and standard prompt optimizers do not offer much help. To address this, we introduce novel requirements-aware prompt optimization mechanisms that can improve performance by 4.8% on average over baselines that naively specify everything in the prompt. Beyond prompt optimization, we envision that effectively managing prompt underspecification requires a broader process, including proactive requirements discovery, evaluation, and monitoring.
zh

[NLP-17] Sense and Sensitivity: Examining the Influence of Semantic Recall on Long Context Code Reasoning

【速读】: 该论文试图解决现代大型语言模型(Large Language Models, LLMs)在利用长上下文进行代码推理时的有效性问题,特别是其回忆能力与代码推理之间的关系。论文的关键解决方案是提出一种名为SemTrace的代码推理技术,用于衡量语义回忆(semantic code recall)的敏感性,并通过该技术揭示了模型在处理需要高语义回忆的任务时表现显著下降的现象,同时区分了词法回忆(lexical code recall)与语义回忆的不同机制。

链接: https://arxiv.org/abs/2505.13353
作者: Adam Štorek,Mukur Gupta,Samira Hajizadeh,Prashast Srivastava,Suman Jana
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Although modern Large Language Models (LLMs) support extremely large contexts, their effectiveness in utilizing long context for code reasoning remains unclear. This paper investigates LLM reasoning ability over code snippets within large repositories and how it relates to their recall ability. Specifically, we differentiate between lexical code recall (verbatim retrieval) and semantic code recall (remembering what the code does). To measure semantic recall, we propose SemTrace, a code reasoning technique where the impact of specific statements on output is attributable and unpredictable. We also present a method to quantify semantic recall sensitivity in existing benchmarks. Our evaluation of state-of-the-art LLMs reveals a significant drop in code reasoning accuracy as a code snippet approaches the middle of the input context, particularly with techniques requiring high semantic recall like SemTrace. Moreover, we find that lexical recall varies by granularity, with models excelling at function retrieval but struggling with line-by-line recall. Notably, a disconnect exists between lexical and semantic recall, suggesting different underlying mechanisms. Finally, our findings indicate that current code reasoning benchmarks may exhibit low semantic recall sensitivity, potentially underestimating LLM challenges in leveraging in-context information.
zh

[NLP-18] Investigating the Vulnerability of LLM -as-a-Judge Architectures to Prompt-Injection Attacks

【速读】: 该论文旨在解决生成式 AI(Generative AI)在作为语言模型评估者(LLM-as-a-Judge)时所面临的可靠性和安全性问题,特别是其对对抗性攻击的脆弱性。研究的核心在于揭示LLM-as-a-Judge架构在面对提示注入攻击(prompt-injection attacks)时的漏洞,并提出相应的攻击策略以验证系统的不稳定性。解决方案的关键在于通过Greedy Coordinate Gradient(GCG)优化方法生成对抗性后缀,进而影响模型的判断过程,具体表现为两种攻击策略: Comparative Undermining Attack(CUA)和Justification Manipulation Attack(JMA),分别针对最终决策输出和模型生成的推理过程。实验结果表明,当前LLM-as-a-Judge系统存在显著的安全隐患,亟需加强防御机制与对抗性评估的研究。

链接: https://arxiv.org/abs/2505.13348
作者: Narek Maloyan,Bislan Ashinov,Dmitry Namiot
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed as evaluators (LLM-as-a-Judge) for assessing the quality of machine-generated text. This paradigm offers scalability and cost-effectiveness compared to human annotation. However, the reliability and security of such systems, particularly their robustness against adversarial manipulations, remain critical concerns. This paper investigates the vulnerability of LLM-as-a-Judge architectures to prompt-injection attacks, where malicious inputs are designed to compromise the judge’s decision-making process. We formalize two primary attack strategies: Comparative Undermining Attack (CUA), which directly targets the final decision output, and Justification Manipulation Attack (JMA), which aims to alter the model’s generated reasoning. Using the Greedy Coordinate Gradient (GCG) optimization method, we craft adversarial suffixes appended to one of the responses being compared. Experiments conducted on the MT-Bench Human Judgments dataset with open-source instruction-tuned LLMs (Qwen2.5-3B-Instruct and Falcon3-3B-Instruct) demonstrate significant susceptibility. The CUA achieves an Attack Success Rate (ASR) exceeding 30%, while JMA also shows notable effectiveness. These findings highlight substantial vulnerabilities in current LLM-as-a-Judge systems, underscoring the need for robust defense mechanisms and further research into adversarial evaluation and trustworthiness in LLM-based assessment frameworks.
zh

[NLP-19] J4R: Learning to Judge with Equivalent Initial State Group Relative Preference Optimization DATE

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 评估中,尤其是涉及复杂推理任务时,LLM-as-judge 模型表现不足的问题。现有评估方法在面对需要深度推理的模型输出时存在局限性,难以准确判断其质量与合理性。解决方案的关键在于通过强化学习(Reinforcement Learning, RL)训练评估模型,具体提出了 Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) 算法,以提升评估模型对位置偏差的鲁棒性,并构建了 ReasoningJudgeBench 基准测试平台,用于全面评估模型在多样化推理场景中的表现。通过该方法训练的 J4R 模型在多个基准测试中表现出色,优于现有主流模型。

链接: https://arxiv.org/abs/2505.13346
作者: Austin Xu,Yilun Zhou,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures, 6 tables. To be updated with links for code/benchmark

点击查看摘要

Abstract:To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.
zh

[NLP-20] Contextual Paralinguistic Data Creation for Multi-Modal Speech-LLM : Data Condensation and Spoken QA Generation INTERSPEECH2025

【速读】: 该论文试图解决当前语音大语言模型(speech-LLMs)在上下文推理与副语言理解方面能力有限的问题,主要原因是缺乏同时涵盖这两个方面的问答(QA)数据集。其解决方案的关键在于提出一种新颖的框架,通过从真实场景中的语音数据中生成数据集,该框架结合了基于伪副语言标签的数据浓缩技术和基于大语言模型(LLM)的上下文副语言问答(Contextual Paralinguistic QA, CPQA)生成方法。

链接: https://arxiv.org/abs/2505.13338
作者: Qiongqiong Wang,Hardik B. Sailor,Tianchi Liu,Ai Ti Aw
机构: Institute for Infocomm Research (I2R)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:Current speech-LLMs exhibit limited capability in contextual reasoning alongside paralinguistic understanding, primarily due to the lack of Question-Answer (QA) datasets that cover both aspects. We propose a novel framework for dataset generation from in-the-wild speech data, that integrates contextual reasoning with paralinguistic information. It consists of a pseudo paralinguistic label-based data condensation of in-the-wild speech and LLM-based Contextual Paralinguistic QA (CPQA) generation. The effectiveness is validated by a strong correlation in evaluations of the Qwen2-Audio-7B-Instruct model on a dataset created by our framework and human-generated CPQA dataset. The results also reveal the speech-LLM’s limitations in handling empathetic reasoning tasks, highlighting the need for such datasets and more robust models. The proposed framework is first of its kind and has potential in training more robust speech-LLMs with paralinguistic reasoning capabilities.
zh

[NLP-21] Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges

【速读】: 该论文试图解决现有评估基准在衡量语言模型(Language Models, LMs)作为语言代理(Language Agents, LAs)进行工具使用时,主要关注无状态、单轮交互或部分评估,而忽视了多轮应用中交互的固有状态特性问题。解决方案的关键在于提出\textttDialogTool,这是一个考虑工具使用全生命周期的多轮对话数据集,涵盖三个阶段的六个关键任务;同时构建\textttVirtualMobile——一个具身化的虚拟移动评估环境,用于模拟API调用并评估所创建API的鲁棒性。通过这些工具,论文对13个不同的开源和闭源大语言模型进行了全面评估,并揭示了当前最先进的LLMs在长期工具使用任务中表现仍不理想。

链接: https://arxiv.org/abs/2505.13328
作者: Hongru Wang,Wenyu Huang,Yufei Wang,Yuanhao Xi,Jianqiao Lu,Huan Zhang,Nan Hu,Zeming Liu,Jeff Z. Pan,Kam-Fai Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose \textttDialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) \textittool creation; 2) \textittool utilization: tool awareness, tool selection, tool execution; and 3) \textitrole-consistent response: response generation and role play. Furthermore, we build \textttVirtualMobile – an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs\footnoteWe will use tools and APIs alternatively, there are no significant differences between them in this paper… Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons.
zh

[NLP-22] GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在部署后如何安全地选择性遗忘特定知识的问题,以确保模型的安全性和合规性。现有方法通过微调模型来实现遗忘,但这种方式会模糊遗忘与保留知识之间的决策边界,从而影响模型的整体性能。该论文提出的解决方案的关键在于在生成阶段(Generation-time)实现动态遗忘,即通过自适应限制和检测(GUARD)框架,在不破坏文本生成流畅性的前提下,利用提示分类器检测遗忘目标并动态惩罚和过滤候选标记,从而有效防止模型泄露被遗忘的内容。

链接: https://arxiv.org/abs/2505.13312
作者: Zhijie Deng,Chris Yuhao Liu,Zirui Pang,Xinlei He,Lei Feng,Qi Xuan,Zhaowei Zhu,Jiaheng Wei
机构: The Hong Kong University of Science and Technology (Guangzhou); UC Santa Cruz; UIUC; Southeast University; BIAI, ZJUT; D5Data.ai
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in memorizing vast amounts of knowledge across diverse domains. However, the ability to selectively forget specific knowledge is critical for ensuring the safety and compliance of deployed models. Existing unlearning efforts typically fine-tune the model with resources such as forget data, retain data, and a calibration model. These additional gradient steps blur the decision boundary between forget and retain knowledge, making unlearning often at the expense of overall performance. To avoid the negative impact of fine-tuning, it would be better to unlearn solely at inference time by safely guarding the model against generating responses related to the forget target, without destroying the fluency of text generation. In this work, we propose Generation-time Unlearning via Adaptive Restriction and Detection (GUARD), a framework that enables dynamic unlearning during LLM generation. Specifically, we first employ a prompt classifier to detect unlearning targets and extract the corresponding forbidden token. We then dynamically penalize and filter candidate tokens during generation using a combination of token matching and semantic matching, effectively preventing the model from leaking the forgotten content. Experimental results on copyright content unlearning tasks over the Harry Potter dataset and the MUSE benchmark, as well as entity unlearning tasks on the TOFU dataset, demonstrate that GUARD achieves strong forget quality across various tasks while causing almost no degradation to the LLM’s general capabilities, striking an excellent trade-off between forgetting and utility.
zh

[NLP-23] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理能力上的不足,特别是在训练算法如灾难性遗忘以及新颖训练数据有限的情况下,如何提升模型的推理性能。其解决方案的关键在于提出一种名为LatentSeek的新框架,该框架通过在模型的潜在空间(latent space)中进行测试时实例级适应(Test-Time Instance-level Adaptation, TTIA),利用策略梯度方法迭代更新潜在表示,并由自生成的奖励信号引导,从而实现更有效的推理和对测试时扩展定律(test-time scaling law)的更好遵循。

链接: https://arxiv.org/abs/2505.13308
作者: Hengli Li,Chenxi Li,Tong Wu,Xuekai Zhu,Yuxuan Wang,Zhaoxin Yu,Eric Hanchen Jiang,Song-Chun Zhu,Zixia Jia,Ying Nian Wu,Zilong Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model’s latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
zh

[NLP-24] RBF: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)在实际应用中的两个主要挑战:一是缺乏量化指标和可操作的指导原则来评估和优化可测量的CoT能力边界,二是缺乏对不可测量的CoT能力边界(如多模态感知)的评估方法。其解决方案的关键在于提出改进的推理边界框架(Reasoning Boundary Framework++, RBF++),通过定义推理边界(Reasoning Boundary, RB)作为CoT性能的最大限制,并引入RB组合定律以实现跨任务的定量分析与优化指导;同时,针对不可测量的RB,提出常数假设和推理边界划分机制,将不可测量的RB分解为两个子边界,从而促进对领域知识和多模态感知能力的量化与优化。

链接: https://arxiv.org/abs/2505.13307
作者: Qiguang Chen,Libo Qin,Jinhao Liu,Yue Liao,Jiaqi Wang,Jingxuan Zhou,Wanxiang Che
机构: Harbin Institute of Technology(哈尔滨工业大学); Central South University(中南大学); Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at this https URL.
zh

[NLP-25] Ill believe it when I see it: Images increase misinformation sharing in Vision-Language Models

【速读】: 该论文试图解决视觉内容对多模态语言模型(Multimodal Language Models, MLMs)在新闻再分享行为中的影响问题,以及不同模型家族和用户人格特征如何调节这一行为。其解决方案的关键在于提出一种受“越狱”启发的提示策略,以模拟具有反社会特质和政治倾向的用户,从而诱发模型的再分享决策,并构建了一个包含经过事实核查的政治新闻的多模态数据集,配以对应图像和真实性的地面真实标签,以支持对模型行为的系统分析。

链接: https://arxiv.org/abs/2505.13302
作者: Alice Plebe,Timothy Douglas,Diana Riazi,R. Maria del Rio-Chanona
机构: University College London (伦敦大学学院); Centre for Artificial Intelligence (人工智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly integrated into news recommendation systems, raising concerns about their role in spreading misinformation. In humans, visual content is known to boost credibility and shareability of information, yet its effect on vision-language models (VLMs) remains unclear. We present the first study examining how images influence VLMs’ propensity to reshare news content, whether this effect varies across model families, and how persona conditioning and content attributes modulate this behavior. To support this analysis, we introduce two methodological contributions: a jailbreaking-inspired prompting strategy that elicits resharing decisions from VLMs while simulating users with antisocial traits and political alignments; and a multimodal dataset of fact-checked political news from PolitiFact, paired with corresponding images and ground-truth veracity labels. Experiments across model families reveal that image presence increases resharing rates by 4.8% for true news and 15.0% for false news. Persona conditioning further modulates this effect: Dark Triad traits amplify resharing of false news, whereas Republican-aligned profiles exhibit reduced veracity sensitivity. Of all the tested models, only Claude-3-Haiku demonstrates robustness to visual misinformation. These findings highlight emerging risks in multimodal model behavior and motivate the development of tailored evaluation frameworks and mitigation strategies for personalized AI systems. Code and dataset are available at: this https URL
zh

[NLP-26] textitRank Chunk and Expand: Lineage-Oriented Reasoning for Taxonomy Expansion

【速读】: 该论文旨在解决知识图谱中分类体系(Taxonomy)扩展的问题,特别是在数据规模不断增长的情况下,现有方法在表示能力、泛化性以及处理噪声和上下文限制方面存在显著挑战。其解决方案的关键在于提出一种名为LORex(Lineage-Oriented Reasoning for Taxonomy Expansion)的即插即用框架,该框架结合了判别式排序与生成式推理,通过将候选术语分批排序和分块处理,过滤噪声并迭代优化选择,从而确保上下文效率,最终在多个基准测试中实现了比现有方法更高的准确率和Wu Palmer相似度。

链接: https://arxiv.org/abs/2505.13282
作者: Sahil Mishra,Kumar Arjun,Tanmoy Chakraborty
机构: IIT Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Taxonomies are hierarchical knowledge graphs crucial for recommendation systems, and web applications. As data grows, expanding taxonomies is essential, but existing methods face key challenges: (1) discriminative models struggle with representation limits and generalization, while (2) generative methods either process all candidates at once, introducing noise and exceeding context limits, or discard relevant entities by selecting noisy candidates. We propose LORex ( \textbfL ineage- \textbfO riented \textbfRe asoning for Taxonomy E \textbfx pansion), a plug-and-play framework that combines discriminative ranking and generative reasoning for efficient taxonomy expansion. Unlike prior methods, LORex ranks and chunks candidate terms into batches, filtering noise and iteratively refining selections by reasoning candidates’ hierarchy to ensure contextual efficiency. Extensive experiments across four benchmarks and twelve baselines show that LORex improves accuracy by 12% and Wu Palmer similarity by 5% over state-of-the-art methods.
zh

[NLP-27] CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning

【速读】: 该论文旨在解决大型语言模型在将自然语言问题转化为SQL查询时的准确性问题,尤其是现有方法如Self-Consistency和Self-Correction各自存在的局限性。Self-Consistency可能选择次优输出,而Self-Correction通常仅处理语法错误。解决方案的关键在于提出CSC-SQL,该方法结合了Self-Consistency和Self-Correction的优势,通过并行采样选择最频繁出现的两个输出,并将其输入合并修订模型进行修正,同时采用Group Relative Policy Optimization (GRPO) 算法通过强化学习对生成和修订模型进行微调,从而显著提升输出质量。

链接: https://arxiv.org/abs/2505.13271
作者: Lei Sheng,Shuai-Shuai Xu
机构: Wuhan University of Technology (武汉理工大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD development set, our 3B model achieves 65.28% execution accuracy, while the 7B model achieves 69.19%. The code will be open sourced at this https URL.
zh

[NLP-28] Representation of perceived prosodic similarity of conversational feedback INTERSPEECH2025

【速读】: 该论文试图解决语音反馈(vocal feedback)在相同词汇形式下,其语调相似性如何被感知以及现有语音表示是否能够反映这种相似性的问题。解决方案的关键在于通过三元比较任务测量不同数据集中的语音反馈的感知相似性,并利用对比学习进一步压缩和对齐语音表示以匹配人类感知,结果显示谱特征和自监督语音表示在捕捉语调信息方面优于提取的基频特征,尤其是在同一说话者的情况下。

链接: https://arxiv.org/abs/2505.13268
作者: Livia Qian,Carol Figueroa,Gabriel Skantze
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Interspeech 2025

点击查看摘要

Abstract:Vocal feedback (e.g., mhm', yeah’, `okay’) is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.
zh

[NLP-29] From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery

【速读】: 该论文试图解决如何系统性地理解和引导大型语言模型(Large Language Models, LLMs)在科学发现中的角色演变与能力提升问题,其核心在于构建一个能够反映LLMs在科研流程中自主性与责任逐步增强的分类框架。解决方案的关键在于提出一个基于科学方法论的三层次分类体系——工具(Tool)、分析者(Analyst)和科学家(Scientist),以此界定LLMs在研究生命周期中的不同自主程度和职责范围,并探讨其在机器人自动化、自我改进和伦理治理等方面面临的挑战与未来研究方向。

链接: https://arxiv.org/abs/2505.13259
作者: Tianshi Zheng,Zheye Deng,Hong Ting Tsang,Weiqi Wang,Jiaxin Bai,Zihao Wang,Yangqiu Song
机构: HKUST(香港科技大学)
类目: Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy-Tool, Analyst, and Scientist-to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement. Github Repository: this https URL.
zh

[NLP-30] Effective and Transparent RAG : Adaptive-Reward Reinforcement Learning for Decision Traceability

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在知识密集型领域中仍存在的两个核心问题:有效性不足与透明度缺失。现有研究主要关注提升检索器的性能,但缺乏对生成器(大型语言模型)如何有效利用检索信息进行推理和生成的深入探讨;同时,大多数RAG方法未能明确哪些检索内容对推理过程产生影响,导致结果缺乏可解释性。为解决这些问题,论文提出了一种基于强化学习(Reinforcement Learning, RL)的透明RAG生成框架ARENA(Adaptive-Rewarded Evidence Navigation Agent),其关键在于通过结构化生成与自适应奖励计算实现模型对关键证据的识别、结构化推理以及生成带有可解释决策轨迹的答案。

链接: https://arxiv.org/abs/2505.13258
作者: Jingyi Ren,Yekun Xu,Xiaolong Wang,Weitao Li,Weizhi Ma,Yang Liu
机构: Tsinghua University (清华大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly improved the performance of large language models (LLMs) on knowledge-intensive domains. However, although RAG achieved successes across distinct domains, there are still some unsolved challenges: 1) Effectiveness. Existing research mainly focuses on developing more powerful RAG retrievers, but how to enhance the generator’s (LLM’s) ability to utilize the retrieved information for reasoning and generation? 2) Transparency. Most RAG methods ignore which retrieved content actually contributes to the reasoning process, resulting in a lack of interpretability and visibility. To address this, we propose ARENA (Adaptive-Rewarded Evidence Navigation Agent), a transparent RAG generator framework trained via reinforcement learning (RL) with our proposed rewards. Based on the structured generation and adaptive reward calculation, our RL-based training enables the model to identify key evidence, perform structured reasoning, and generate answers with interpretable decision traces. Applied to Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct, abundant experiments with various RAG baselines demonstrate that our model achieves 10-30% improvements on all multi-hop QA datasets, which is comparable with the SOTA Commercially-developed LLMs (e.g., OpenAI-o1, DeepSeek-R1). Further analyses show that ARENA has strong flexibility to be adopted on new datasets without extra training. Our models and codes are publicly released.
zh

[NLP-31] WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?

【速读】: 该论文试图解决个性化偏好对齐(personalized preference alignment)中缺乏细粒度个体层面偏好数据集的问题。现有方法多关注平均化的人类偏好,忽略了个体间的多样性和矛盾性,而个性化对齐需要更细致的数据支持。解决方案的关键在于引入WikiPersona数据集,该数据集通过使用有详细记录的知名人物进行细粒度个性化建模,挑战模型在可解释过程中生成与特定人物背景和偏好相一致的可验证文本描述,从而实现更有效的个性化对齐。

链接: https://arxiv.org/abs/2505.13257
作者: Zilu Tang,Afra Feyza Akyürek,Ekin Akyürek,Derry Wijaya
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, preprint

点击查看摘要

Abstract:Preference alignment has become a standard pipeline in finetuning models to follow \emphgeneric human preferences. Majority of work seeks to optimize model to produce responses that would be preferable \emphon average, simplifying the diverse and often \emphcontradicting space of human preferences. While research has increasingly focused on personalized alignment: adapting models to individual user preferences, there is a lack of personalized preference dataset which focus on nuanced individual-level preferences. To address this, we introduce WikiPersona: the first fine-grained personalization using well-documented, famous individuals. Our dataset challenges models to align with these personas through an interpretable process: generating verifiable textual descriptions of a persona’s background and preferences in addition to alignment. We systematically evaluate different personalization approaches and find that as few-shot prompting with preferences and fine-tuning fail to simultaneously ensure effectiveness and efficiency, using \textitinferred personal preferences as prefixes enables effective personalization, especially in topics where preferences clash while leading to more equitable generalization across unseen personas.
zh

[NLP-32] HeteroSpec: Leverag ing Contextual Heterogeneity for Efficient Speculative Decoding

【速读】: 该论文旨在解决自回归解码(autoregressive decoding)在大型语言模型(Large Language Model, LLM)推理中的效率瓶颈问题,该问题源于其顺序性导致的计算资源利用率低下。论文提出的解决方案——HeteroSpec,关键在于通过动态优化计算资源分配来适应语言上下文的复杂性差异。HeteroSpec引入了两个核心机制:一是基于累积元路径Top-K熵的度量方法,用于高效识别可预测的上下文;二是基于数据驱动熵划分的动态资源分配策略,实现针对局部上下文难度的自适应推测扩展与剪枝。

链接: https://arxiv.org/abs/2505.13254
作者: Siran Liu,Yang Ye,Qianchao Zhu,Zheng Cao,Yongchao He
机构: Peking University (北京大学); SCITIX (SGP) TECH PTE. LTD. (SCITIX (SGP) 科技有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive decoding, the standard approach for Large Language Model (LLM) inference, remains a significant bottleneck due to its sequential nature. While speculative decoding algorithms mitigate this inefficiency through parallel verification, they fail to exploit the inherent heterogeneity in linguistic complexity, a key factor leading to suboptimal resource allocation. We address this by proposing HeteroSpec, a heterogeneity-adaptive speculative decoding framework that dynamically optimizes computational resource allocation based on linguistic context complexity. HeteroSpec introduces two key mechanisms: (1) A novel cumulative meta-path Top- K entropy metric for efficiently identifying predictable contexts. (2) A dynamic resource allocation strategy based on data-driven entropy partitioning, enabling adaptive speculative expansion and pruning tailored to local context difficulty. Evaluated on five public benchmarks and four models, HeteroSpec achieves an average speedup of 4.26 \times . It consistently outperforms state-of-the-art EAGLE-3 across speedup rates, average acceptance length, and verification cost. Notably, HeteroSpec requires no draft model retraining, incurs minimal overhead, and is orthogonal to other acceleration techniques. It demonstrates enhanced acceleration with stronger draft models, establishing a new paradigm for context-aware LLM inference acceleration.
zh

[NLP-33] Natural Language Planning via Coding and Inference Scaling

【速读】: 该论文试图解决复杂文本规划任务(如会议安排)中大型语言模型(Large Language Models, LLMs)面临的挑战,尤其是在任务复杂度较高时的表现问题。其解决方案的关键在于系统性评估闭源与开源模型在生成可执行程序(包括标准Python代码和约束满足问题求解器代码)中的表现,并通过扩展输出长度以适应任务复杂度。研究发现,尽管任务具有算法性质,但编程方法在多数情况下优于传统规划方法,但生成代码在鲁棒性和效率方面仍存在不足,限制了其泛化能力。

链接: https://arxiv.org/abs/2505.13252
作者: Rikhil Amonkar,Ronan Le Bras,Li Zhang
机构: Drexel University (德雷塞尔大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-life textual planning tasks such as meeting scheduling have posed much challenge to LLMs especially when the complexity is high. While previous work primarily studied auto-regressive generation of plans with closed-source models, we systematically evaluate both closed- and open-source models, including those that scales output length with complexity during inference, in generating programs, which are executed to output the plan. We consider not only standard Python code, but also the code to a constraint satisfaction problem solver. Despite the algorithmic nature of the task, we show that programming often but not always outperforms planning. Our detailed error analysis also indicates a lack of robustness and efficiency in the generated code that hinders generalization.
zh

[NLP-34] Stronger Together: Unleashing the Social Impact of Hate Speech Research

【速读】: 该论文试图解决数字空间中反社会行为带来的新兴风险和危害,特别是仇恨言论、虚假信息和错误信息的传播问题。解决方案的关键在于将语言学和自然语言处理(Natural Language Processing, NLP)研究与社会方法相结合,通过跨学科合作,联合社区、倡导者、活动家和政策制定者,推动更具包容性的数字包容性并缩小数字鸿沟。

链接: https://arxiv.org/abs/2505.13251
作者: Sidney Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted Proceedings of the Linguistic Society of America 2025 Annual Meeting

点击查看摘要

Abstract:The advent of the internet has been both a blessing and a curse for once marginalised communities. When used well, the internet can be used to connect and establish communities crossing different intersections; however, it can also be used as a tool to alienate people and communities as well as perpetuate hate, misinformation, and disinformation especially on social media platforms. We propose steering hate speech research and researchers away from pre-existing computational solutions and consider social methods to inform social solutions to address this social problem. In a similar way linguistics research can inform language planning policy, linguists should apply what we know about language and society to mitigate some of the emergent risks and dangers of anti-social behaviour in digital spaces. We argue linguists and NLP researchers can play a principle role in unleashing the social impact potential of linguistics research working alongside communities, advocates, activists, and policymakers to enable equitable digital inclusion and to close the digital divide.
zh

[NLP-35] JNLP at SemEval-2025 Task 11: Cross-Lingual Multi-Label Emotion Detection Using Generative Models SEMEVAL-2025

【速读】: 该论文旨在解决跨语言文本情感检测中的多标签情感识别与情感强度问题(multilingual multi-label emotion detection and emotion intensity),这是在全球数字化快速发展的背景下,用户通过社交媒体进行信息交流时面临的重要挑战。其解决方案的关键在于利用预训练的多语言模型,采用两种架构:基于微调的BERT分类模型和指令调优的生成式大语言模型(generative LLM)。此外,论文提出了两种处理多标签分类的方法:基础方法直接将输入映射到所有对应的情感标签,而成对方法则分别建模输入文本与每个情感类别之间的关系,从而提升了模型在多语言环境下的泛化能力和性能。

链接: https://arxiv.org/abs/2505.13244
作者: Jieying Xue,Phuong Minh Nguyen,Minh Le Nguyen,Xin Liu
机构: Japan Advanced Institute of Science and Technology (日本高级科学技朮研究所); ROIS-DS Center for Juris-Informatics, NII, Tokyo, Japan (ROIS-DS法律信息中心,NII,东京,日本); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in The 19th International Workshop on Semantic Evaluation (SemEval-2025)

点击查看摘要

Abstract:With the rapid advancement of global digitalization, users from different countries increasingly rely on social media for information exchange. In this context, multilingual multi-label emotion detection has emerged as a critical research area. This study addresses SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. Our paper focuses on two sub-tracks of this task: (1) Track A: Multi-label emotion detection, and (2) Track B: Emotion intensity. To tackle multilingual challenges, we leverage pre-trained multilingual models and focus on two architectures: (1) a fine-tuned BERT-based classification model and (2) an instruction-tuned generative LLM. Additionally, we propose two methods for handling multi-label classification: the base method, which maps an input directly to all its corresponding emotion labels, and the pairwise method, which models the relationship between the input text and each emotion category individually. Experimental results demonstrate the strong generalization ability of our approach in multilingual emotion recognition. In Track A, our method achieved Top 4 performance across 10 languages, ranking 1st in Hindi. In Track B, our approach also secured Top 5 performance in 7 languages, highlighting its simplicity and effectiveness\footnoteOur code is available at this https URL.
zh

[NLP-36] Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

【速读】: 该论文旨在解决图形用户界面(Graphical User Interface, GUI)接地问题,即如何将自然语言指令映射到图形用户界面上的具体操作,这一问题在计算机使用代理开发中仍是一个关键瓶颈。现有基准将接地任务简化为短指代表达,未能捕捉真实交互中所需的软件常识、布局理解和细粒度操作能力。解决方案的关键在于引入OSWorld-G基准和合成的Jedi数据集,其中Jedi包含400万例通过多视角解耦任务生成的数据,用于提升模型在GUI接地任务上的表现。此外,实验表明,基于Jedi训练的多尺度模型在多个基准上优于现有方法,并显著提升了通用基础模型在复杂计算机任务中的代理能力。

链接: https://arxiv.org/abs/2505.13227
作者: Tianbao Xie,Jiaqi Deng,Xiaochuan Li,Junlin Yang,Haoyuan Wu,Jixuan Chen,Wenjing Hu,Xinyuan Wang,Yuhui Xu,Zekun Wang,Yiheng Xu,Junli Wang,Doyen Sahoo,Tao Yu,Caiming Xiong
机构: The University of Hong Kong(香港大学); Salesforce AI Research(Salesforce人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 49 pages, 13 figures

点击查看摘要

Abstract:Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at this https URL.
zh

[NLP-37] SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science ACL2025

【速读】: 该论文试图解决种子科学领域中存在的跨学科复杂性高、成本高且回报有限的问题,以及由此导致的专家短缺和技术支持不足。其解决方案的关键是引入SeedBench——首个针对种子科学设计的多任务基准测试平台,该平台由领域专家共同开发,专注于种子育种并模拟现代育种过程的关键方面,旨在为生成式 AI 在种子设计中的研究提供基础支撑。

链接: https://arxiv.org/abs/2505.13220
作者: Jie Ying,Zihong Chen,Zhefan Wang,Wanli Jiang,Chenyang Wang,Zhonghang Yuan,Haoyang Su,Huanjun Kong,Fan Yang,Nanqing Dong
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Yazhouwan National Laboratory (亚州湾国家实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025

点击查看摘要

Abstract:Seed science is essential for modern agriculture, directly influencing crop yields and global food security. However, challenges such as interdisciplinary complexity and high costs with limited returns hinder progress, leading to a shortage of experts and insufficient technological support. While large language models (LLMs) have shown promise across various fields, their application in seed science remains limited due to the scarcity of digital resources, complex gene-trait relationships, and the lack of standardized benchmarks. To address this gap, we introduce SeedBench – the first multi-task benchmark specifically designed for seed science. Developed in collaboration with domain experts, SeedBench focuses on seed breeding and simulates key aspects of modern breeding processes. We conduct a comprehensive evaluation of 26 leading LLMs, encompassing proprietary, open-source, and domain-specific fine-tuned models. Our findings not only highlight the substantial gaps between the power of LLMs and the real-world seed science problems, but also make a foundational step for research on LLMs for seed design.
zh

[NLP-38] Picturized and Recited with Dialects: A Multimodal Chinese Representation Framework for Sentiment Analysis of Classical Chinese Poetry

【速读】: 该论文试图解决古典中文诗歌情感分析中忽视韵律和视觉特征的问题,现有研究主要基于文本含义进行情感分析,而忽略了诗歌在朗诵和配画背景下的独特表现形式。解决方案的关键在于提出一种方言增强的多模态框架,通过提取句级音频特征并融合多种方言音频以丰富语音表征,同时生成句级视觉特征,并利用大语言模型(LLM)翻译增强的文本特征进行多模态对比表示学习,实现多模态特征的融合。

链接: https://arxiv.org/abs/2505.13210
作者: Xiaocong Du,Haoyu Pei,Haipeng Zhang
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classical Chinese poetry is a vital and enduring part of Chinese literature, conveying profound emotional resonance. Existing studies analyze sentiment based on textual meanings, overlooking the unique rhythmic and visual features inherent in poetry,especially since it is often recited and accompanied by Chinese paintings. In this work, we propose a dialect-enhanced multimodal framework for classical Chinese poetry sentiment analysis. We extract sentence-level audio features from the poetry and incorporate audio from multiple dialects,which may retain regional ancient Chinese phonetic features, enriching the phonetic representation. Additionally, we generate sentence-level visual features, and the multimodal features are fused with textual features enhanced by LLM translation through multimodal contrastive representation learning. Our framework outperforms state-of-the-art methods on two public datasets, achieving at least 2.51% improvement in accuracy and 1.63% in macro F1. We open-source the code to facilitate research in this area and provide insights for general multimodal Chinese representation.
zh

[NLP-39] Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在自回归生成过程中因草稿候选与目标模型采样输出之间对齐不足而导致的推理效率和生成质量受限的问题。其解决方案的关键在于提出一种无需训练的对齐增强型推测解码算法,通过引入对齐采样(alignment sampling)利用预填充阶段获得的输出分布生成更对齐的草稿候选,并结合一种灵活的验证策略,以适应高质量但非对齐的草稿候选,从而在提升生成准确性的同时进一步优化推理效率。

链接: https://arxiv.org/abs/2505.13204
作者: Jikai Wang,Zhenxu Tian,Juntao Li,Qingrong Xia,Xinyu Duan,Zhefeng Wang,Baoxing Huai,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further improving inference efficiency. Experiments on 8 datasets (including question answering, summarization and code completion tasks) show that our approach increases the average generation score by 3.3 points for the LLaMA3 model. Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.
zh

[NLP-40] Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

【速读】: 该论文试图解决传统语音语言模型中依赖残差向量量化(Residual Vector Quantization)所带来的离散化误差以及复杂层级结构的问题。其解决方案的关键在于引入SLED,通过将语音波形编码为连续潜在表示序列,并使用能量距离目标进行自回归建模,从而避免了离散化过程,简化了整体建模流程,同时保持了语音信息的丰富性和推理效率。

链接: https://arxiv.org/abs/2505.13181
作者: Zhengrui Ma,Yang Feng,Chenze Shao,Fandong Meng,Jie Zhou,Min Zhang
机构: Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Demos and code are available at this https URL

点击查看摘要

Abstract:We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models.
zh

[NLP-41] oolSpectrum : Towards Personalized Tool Utilization for Large Language Models ACL2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在集成外部工具时,缺乏对用户个性化需求和环境因素的上下文感知能力的问题。现有方法过于关注功能性的工具选择,而忽略了基于用户画像和环境因素的个性化工具使用,导致用户体验不佳和工具利用效率低下。论文提出的解决方案关键在于引入ToolSpectrum基准,通过形式化用户画像和环境因素两个核心维度,分析其对工具使用的影响,并验证个性化工具利用能够显著提升用户体验。然而,当前最先进的LLMs在联合推理用户画像与环境因素方面仍存在局限性。

链接: https://arxiv.org/abs/2505.13176
作者: Zihao Cheng,Hongru Wang,Zeming Liu,Yuhang Guo,Yuanfang Guo,Yunhong Wang,Haifeng Wang
机构: Beihang University (北京航空航天大学); Beijing Institute of Technology (北京理工大学); University of Science and Technology Beijing (北京科技大学); The Chinese University of Hong Kong (香港中文大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Findings

点击查看摘要

Abstract:While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions, overlooking the context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs’ capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool utilization. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool utilization significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code are available at this https URL.
zh

[NLP-42] A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLM s ACL2025

【速读】: 该论文试图解决跨语言零样本泛化能力在古典语言(如梵语、古希腊语和拉丁语)中的影响因素问题,特别是针对自然语言理解任务的性能表现。其解决方案的关键在于通过检索增强生成方法引入上下文信息,以提升模型在事实性问答任务中的表现,同时揭示模型规模对跨语言泛化能力的重要影响。

链接: https://arxiv.org/abs/2505.13173
作者: V.S.D.S.Mahesh Akavarapu,Hrishikesh Terdalkar,Pramit Bhattacharyya,Shubhangi Agarwal,Vishakha Deulgaonkar,Pralay Manna,Chaitali Dangarikar,Arnab Bhattacharya
机构: University of Tübingen (图宾根大学); University of Lyon 1 (里昂第一大学); Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages – Sanskrit, Ancient Greek and Latin – to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question-answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.
zh

[NLP-43] Positional Frag ility in LLM s: How Offset Effects Reshape Our Understanding of Memorization Risks

【速读】: 该论文试图解决大型语言模型在训练过程中可能记忆训练数据中受版权保护内容所带来的版权侵权风险问题。其解决方案的关键在于发现并利用“位置偏移效应”(offset effect),即模型对上下文窗口中较早出现的短前缀更易产生逐字记忆,而当前缀偏离初始位置时,逐字回忆能力会显著下降。研究通过将敏感数据向上下文窗口深处偏移,有效抑制了可提取的记忆和文本退化现象,从而为评估和降低记忆风险提供了一个新的关键维度。

链接: https://arxiv.org/abs/2505.13171
作者: Yixuan Xu,Antoine Bosselut,Imanol Schlag
机构: ETH AI Center, ETH Zürich, Switzerland; EPFL, Switzerland
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are known to memorize parts of their training data, posing risk of copyright violations. To systematically examine this risk, we pretrain language models (1B/3B/8B) from scratch on 83B tokens, mixing web-scale data with public domain books used to simulate copyrighted content at controlled frequencies at lengths at least ten times longer than prior work. We thereby identified the offset effect, a phenomenon characterized by two key findings: (1) verbatim memorization is most strongly triggered by short prefixes drawn from the beginning of the context window, with memorization decreasing counterintuitively as prefix length increases; and (2) a sharp decline in verbatim recall when prefix begins offset from the initial tokens of the context window. We attribute this to positional fragility: models rely disproportionately on the earliest tokens in their context window as retrieval anchors, making them sensitive to even slight shifts. We further observe that when the model fails to retrieve memorized content, it often produces degenerated text. Leveraging these findings, we show that shifting sensitive data deeper into the context window suppresses both extractable memorization and degeneration. Our results suggest that positional offset is a critical and previously overlooked axis for evaluating memorization risks, since prior work implicitly assumed uniformity by probing only from the beginning of training sequences.
zh

[NLP-44] Role-Playing Evaluation for Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在角色扮演能力评估中的挑战,尤其是传统的人工评估成本高且自动化评估可能存在偏差的问题。解决方案的关键是提出一种名为Role-Playing Eval (RPEval)的新基准,用于从情感理解、决策、道德一致性以及角色内一致性四个关键维度评估LLM的角色扮演能力。

链接: https://arxiv.org/abs/2505.13157
作者: Yassine El Boudouri,Walter Nuninger,Julian Alvarez,Yvan Peter
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at this https URL
zh

[NLP-45] anyi: A Traditional Chinese Medicine all-rounder language model and its Real-World Clinical Practice

【速读】: 该论文旨在解决传统中医药(TCM)在实际应用中面临的精准辨证、治则确定与方剂制定等复杂问题,以及现有基于机器学习和深度学习的TCM决策系统在数据局限性和单一目标约束方面的不足。其解决方案的关键在于引入一个专为TCM设计的76亿参数大语言模型(LLM)——Tianyi,该模型通过在多样化的TCM语料库上进行预训练和微调,能够系统性地吸收和整合中医药知识,并通过渐进式学习方式提升其专业能力。此外,研究还构建了TCMEval评估基准,以全面评估模型在中医药考试、临床任务、领域特定问答和真实场景中的表现。

链接: https://arxiv.org/abs/2505.13156
作者: Zhi Liu,Tao Yang,Jing Wang,Yexin Chen,Zhan Gao,Jiaxi Yang,Kui Chen,Bingji Lu,Xiaochen Li,Changyong Luo,Yan Li,Xiaohong Gu,Peng Cao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures, and 1 tables

点击查看摘要

Abstract:Natural medicines, particularly Traditional Chinese Medicine (TCM), are gaining global recognition for their therapeutic potential in addressing human symptoms and diseases. TCM, with its systematic theories and extensive practical experience, provides abundant resources for healthcare. However, the effective application of TCM requires precise syndrome diagnosis, determination of treatment principles, and prescription formulation, which demand decades of clinical expertise. Despite advancements in TCM-based decision systems, machine learning, and deep learning research, limitations in data and single-objective constraints hinder their practical application. In recent years, large language models (LLMs) have demonstrated potential in complex tasks, but lack specialization in TCM and face significant challenges, such as too big model scale to deploy and issues with hallucination. To address these challenges, we introduce Tianyi with 7.6-billion-parameter LLM, a model scale proper and specifically designed for TCM, pre-trained and fine-tuned on diverse TCM corpora, including classical texts, expert treatises, clinical records, and knowledge graphs. Tianyi is designed to assimilate interconnected and systematic TCM knowledge through a progressive learning manner. Additionally, we establish TCMEval, a comprehensive evaluation benchmark, to assess LLMs in TCM examinations, clinical tasks, domain-specific question-answering, and real-world trials. The extensive evaluations demonstrate the significant potential of Tianyi as an AI assistant in TCM clinical practice and research, bridging the gap between TCM knowledge and practical application.
zh

[NLP-46] What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text

【速读】: 该论文试图解决的问题是:能否仅通过书面文本检测欺骗行为。研究认为,现有自动欺骗检测的成功可能主要由数据收集过程中引入的偏差驱动,并不能推广到其他数据集。其解决方案的关键在于提出一种基于信念的欺骗框架(belief-based deception framework),将欺骗定义为作者陈述与真实信念之间的不一致,而非事实准确性,从而能够在独立于事实的情况下研究欺骗线索。基于该框架,研究构建了三个语料库(统称为DeFaBel),并在不同条件下收集数据以支持跨语言分析,进而评估常见的语言欺骗线索,结果表明这些线索在DeFaBel语料库中与欺骗标签的相关性极低且统计上不显著。

链接: https://arxiv.org/abs/2505.13147
作者: Aswathy Velutharambath,Roman Klinger,Kai Sassenberg
机构: Institut für Maschinelle Sprachverarbeitung, University of Stuttgart(机器语言处理研究所,斯图加特大学); University of Bamberg(班贝格大学); Leibniz Institute for Psychology(莱布尼茨心理学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can deception be detected solely from written text? Cues of deceptive communication are inherently subtle, even more so in text-only communication. Yet, prior studies have reported considerable success in automatic deception detection. We hypothesize that such findings are largely driven by artifacts introduced during data collection and do not generalize beyond specific datasets. We revisit this assumption by introducing a belief-based deception framework, which defines deception as a misalignment between an author’s claims and true beliefs, irrespective of factual accuracy, allowing deception cues to be studied in isolation. Based on this framework, we construct three corpora, collectively referred to as DeFaBel, including a German-language corpus of deceptive and non-deceptive arguments and a multilingual version in German and English, each collected under varying conditions to account for belief change and enable cross-linguistic analysis. Using these corpora, we evaluate commonly reported linguistic cues of deception. Across all three DeFaBel variants, these cues show negligible, statistically insignificant correlations with deception labels, contrary to prior work that treats such cues as reliable indicators. We further benchmark against other English deception datasets following similar data collection protocols. While some show statistically significant correlations, effect sizes remain low and, critically, the set of predictive cues is inconsistent across datasets. We also evaluate deception detection using feature-based models, pretrained language models, and instruction-tuned large language models. While some models perform well on established deception datasets, they consistently perform near chance on DeFaBel. Our findings challenge the assumption that deception can be reliably inferred from linguistic cues and call for rethinking how deception is studied and modeled in NLP.
zh

[NLP-47] Understanding Cross-Lingual Inconsistency in Large Language Models

【速读】: 该论文试图解决语言模型在跨语言推理任务中表现出的输出不一致问题,具体表现为当使用不同语言的相同查询时,模型预测结果可能不一致。其解决方案的关键在于通过“logit lens”分析模型在处理多语言多选推理问题时的隐含步骤,并发现模型依赖于各自语言的子空间而非共享语义空间,导致知识迁移能力受限。研究进一步表明,通过引导模型的潜在表示向共享语义空间靠拢,可以增强模型的跨语言知识共享能力,从而提升多语言推理性能和输出一致性。

链接: https://arxiv.org/abs/2505.13141
作者: Zheng Wei Lim,Alham Fikri Aji,Trevor Cohn
机构: The University of Melbourne (墨尔本大学); MBZUAI; Google(谷歌)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are demonstrably capable of cross-lingual transfer, but can produce inconsistent output when prompted with the same queries written in different languages. To understand how language models are able to generalize knowledge from one language to the others, we apply the logit lens to interpret the implicit steps taken by LLMs to solve multilingual multi-choice reasoning questions. We find LLMs predict inconsistently and are less accurate because they rely on subspaces of individual languages, rather than working in a shared semantic space. While larger models are more multilingual, we show their hidden states are more likely to dissociate from the shared representation compared to smaller models, but are nevertheless more capable of retrieving knowledge embedded across different languages. Finally, we demonstrate that knowledge sharing can be modulated by steering the models’ latent processing towards the shared semantic space. We find reinforcing utilization of the shared space improves the models’ multilingual reasoning performance, as a result of more knowledge transfer from, and better output consistency with English.
zh

[NLP-48] ModernGBERT: German-only 1B Encoder Model Trained from Scratch

【速读】: 该论文旨在解决在资源受限场景下编码器(encoder)模型的性能与效率问题,尤其是在德语自然语言处理(Natural Language Processing, NLP)领域。传统上,解码器-only语言模型(decoder-only language models)受到广泛关注,但编码器仍对于资源受限的应用至关重要。论文提出的解决方案是通过从头训练(trained from scratch)的方式构建高效的德语编码器模型,如ModernGBERT(134M, 1B),并引入来自ModernBERT的架构创新,以提升模型性能与参数效率。此外,论文还提出了LLäMmlein2Vec系列模型,通过将德语解码器模型转换为编码器形式,进一步评估从头训练编码器的优势。实验结果表明,ModernGBERT 1B在多项任务中优于现有最先进的德语编码器及通过LLM2Vec转换得到的编码器,证明了其在性能和参数效率上的优越性。

链接: https://arxiv.org/abs/2505.13136
作者: Anton Ehrmanntraut,Julia Wunderle,Jan Pfister,Fotis Jannidis,Andreas Hotho
机构: JMU – Julius-Maximilians-Universität Würzburg (JMU – 朱利叶斯-马克斯米利安-维尔茨堡大学); CAIDAS – Center for Artificial Intelligence and Data Science (CAIDAS – 人工智能与数据科学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: under review @ARR

点击查看摘要

Abstract:Despite the prominence of decoder-only language models, encoders remain crucial for resource-constrained applications. We introduce ModernGBERT (134M, 1B), a fully transparent family of German encoder models trained from scratch, incorporating architectural innovations from ModernBERT. To evaluate the practical trade-offs of training encoders from scratch, we also present LLäMmlein2Vec (120M, 1B, 7B), a family of encoders derived from German decoder-only models via LLM2Vec. We benchmark all models on natural language understanding, text embedding, and long-context reasoning tasks, enabling a controlled comparison between dedicated encoders and converted decoders. Our results show that ModernGBERT 1B outperforms prior state-of-the-art German encoders as well as encoders adapted via LLM2Vec, with regard to performance and parameter-efficiency. All models, training data, checkpoints and code are publicly available, advancing the German NLP ecosystem with transparent, high-performance encoder models.
zh

[NLP-49] Zero-Shot Iterative Formalization and Planning in Partially Observable Environments

【速读】: 该论文试图解决在部分可观测环境中使用大型语言模型(Large Language Models, LLMs)进行规划时面临的挑战,传统方法因缺乏完整信息而难以有效应对。解决方案的关键在于提出PDDLego+框架,该框架通过零样本方式迭代地形式化、规划、扩展和优化PDDL表示,无需依赖任何现有轨迹数据,从而在文本模拟环境中实现了优越的性能和对问题复杂性的鲁棒性。

链接: https://arxiv.org/abs/2505.13126
作者: Liancheng Gong,Wang Zhu,Jesse Thomason,Li Zhang
机构: Drexel University (德雷塞尔大学); University of Southern California (南加州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In planning, using LLMs not to predict plans but to formalize an environment into the Planning Domain Definition Language (PDDL) has been shown to greatly improve performance and control. While most work focused on fully observable environments, we tackle the more realistic and challenging partially observable environments where existing methods are incapacitated by the lack of complete information. We propose PDDLego+, a framework to iteratively formalize, plan, grow, and refine PDDL representations in a zero-shot manner, without needing access to any existing trajectories. On two textual simulated environments, we show that PDDLego+ not only achieves superior performance, but also shows robustness against problem complexity. We also show that the domain knowledge captured after a successful trial is interpretable and benefits future tasks.
zh

[NLP-50] Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning INTERSPEECH

【速读】: 该论文试图解决大型音频语言模型(Large Audio Language Models, LALMs)在时间推理任务上的评估问题,这类任务与传统的分类或生成任务不同,需要模型具备更复杂的多模态理解能力。解决方案的关键是提出一个名为时间推理音频评估(Temporal Reasoning Evaluation of Audio, TREA)的新数据集,并引入一种不确定性度量,用于评估模型对语义相同输入扰动的不变性。通过分析发现,准确性与不确定性指标之间并非必然相关,这表明在高风险应用中需要对LALMs进行更为全面的评估。

链接: https://arxiv.org/abs/2505.13115
作者: Debarpan Bhattacharya,Apoorva Kulkarni,Sriram Ganapathy
机构: Indian Institute of Science (印度科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in INTERSPEECH, 2025, Rotterdam, The Netherlands

点击查看摘要

Abstract:The popular success of text-based large language models (LLM) has streamlined the attention of the multimodal community to combine other modalities like vision and audio along with text to achieve similar multimodal capabilities. In this quest, large audio language models (LALMs) have to be evaluated on reasoning related tasks which are different from traditional classification or generation tasks. Towards this goal, we propose a novel dataset called temporal reasoning evaluation of audio (TREA). We benchmark open-source LALMs and observe that they are consistently behind human capabilities on the tasks in the TREA dataset. While evaluating LALMs, we also propose an uncertainty metric, which computes the invariance of the model to semantically identical perturbations of the input. Our analysis shows that the accuracy and uncertainty metrics are not necessarily correlated and thus, points to a need for wholesome evaluation of LALMs for high-stakes applications. Comments: Accepted in INTERSPEECH, 2025, Rotterdam, The Netherlands Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2505.13115 [cs.CL] (or arXiv:2505.13115v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.13115 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-51] FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在处理长上下文时面临的部署挑战,尤其是键值缓存(KV cache)随上下文长度增加而带来的存储和效率问题。其解决方案的关键在于提出FreeKV,一个算法与系统协同优化的框架,通过引入推测性检索(speculative retrieval)将KV选择与召回过程移出关键路径,并结合细粒度校正以保证准确性;同时,在系统层面采用CPU和GPU内存的混合KV布局以消除碎片化数据传输,并利用双缓冲流式召回进一步提升效率。

链接: https://arxiv.org/abs/2505.13109
作者: Guangda Liu,Chengwei Li,Zhenyu Ning,Jing Lin,Yiwu Yao,Danning Ke,Minyi Guo,Jieru Zhao
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Technologies Co., Ltd (华为技术有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods are proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, an algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to 13 \times speedup compared to SOTA KV retrieval methods.
zh

[NLP-52] LLM -KG-Bench 3.0: A Compass for SemanticTechnology Capabilities in the Ocean of LLM s ESWC2025

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在语义网和知识图谱工程(Knowledge Graph Engineering, KGE)领域中的能力评估问题,特别是如何自动化评估LLMs在处理语义技术方面的表现。解决方案的关键是提出并改进LLM-KG-Bench框架3.0,该框架包含可扩展的任务集,用于自动化评估LLMs的回答,并支持多种语义技术任务,如RDF和SPARQL操作以及Turtle和JSON-LD序列化任务。此外,框架通过vllm库扩展了对多种开源模型的支持,并提供了涵盖超过30个现代开源和专有LLMs的综合数据集,以支持模型性能的比较与展示。

链接: https://arxiv.org/abs/2505.13098
作者: Lars-Peter Meyer,Johannes Frey,Desiree Heim,Felix Brei,Claus Stadler,Kurt Junghanns,Michael Martin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: Peer reviewed publication at ESWC 2025 Resources Track

点击查看摘要

Abstract:Current Large Language Models (LLMs) can assist developing program code beside many other things, but can they support working with Knowledge Graphs (KGs) as well? Which LLM is offering the best capabilities in the field of Semantic Web and Knowledge Graph Engineering (KGE)? Is this possible to determine without checking many answers manually? The LLM-KG-Bench framework in Version 3.0 is designed to answer these questions. It consists of an extensible set of tasks for automated evaluation of LLM answers and covers different aspects of working with semantic technologies. In this paper the LLM-KG-Bench framework is presented in Version 3 along with a dataset of prompts, answers and evaluations generated with it and several state-of-the-art LLMs. Significant enhancements have been made to the framework since its initial release, including an updated task API that offers greater flexibility in handling evaluation tasks, revised tasks, and extended support for various open models through the vllm library, among other improvements. A comprehensive dataset has been generated using more than 30 contemporary open and proprietary LLMs, enabling the creation of exemplary model cards that demonstrate the models’ capabilities in working with RDF and SPARQL, as well as comparing their performance on Turtle and JSON-LD RDF serialization tasks.
zh

[NLP-53] he Effect of Language Diversity When Fine-Tuning Large Language Models for Translation

【速读】: 该论文试图解决在大型语言模型(Large Language Model, LLM)微调过程中语言多样性对翻译质量影响的争议问题,即不同研究对语言多样性是否能提升模型性能存在分歧。其解决方案的关键在于通过在132个翻译方向上的受控微调实验,系统性地验证语言多样性对翻译质量的影响,发现增加微调阶段的语言多样性能够提升无监督和有监督翻译对的翻译质量,且这种提升源于语言无关表示(language-agnostic representations)的增强,而这一现象在超过一定多样性阈值后会趋于饱和或下降。

链接: https://arxiv.org/abs/2505.13090
作者: David Stap,Christof Monz
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior research diverges on language diversity in LLM fine-tuning: Some studies report benefits while others find no advantages. Through controlled fine-tuning experiments across 132 translation directions, we systematically resolve these disparities. We find that expanding language diversity during fine-tuning improves translation quality for both unsupervised and – surprisingly – supervised pairs, despite less diverse models being fine-tuned exclusively on these supervised pairs. However, benefits plateau or decrease beyond a certain diversity threshold. We show that increased language diversity creates more language-agnostic representations. These representational adaptations help explain the improved performance in models fine-tuned with greater diversity.
zh

[NLP-54] Systematic Generalization in Language Models Scales with Information Entropy ACL2025

【速读】: 该论文试图解决当前语言模型在系统性泛化(systematic generalization)方面的挑战,即模型对输入中语义相似的排列敏感,并且在新情境下处理已知概念时表现不佳。论文的关键解决方案是通过训练数据中组成部分分布的熵(entropy)来描述系统性泛化的一个方面,并提出了一个用于序列到序列任务中测量熵的框架。研究发现,主流模型架构的性能与熵值呈正相关,这将系统性泛化与信息效率联系起来,表明在高熵情况下无需内置先验知识即可取得成功,而低熵情况下的表现可作为评估鲁棒系统性泛化的指标。

链接: https://arxiv.org/abs/2505.13089
作者: Sondre Wold,Lucas Georges Gabriel Charpentier,Étienne Simon
机构: University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025: Findings

点击查看摘要

Abstract:Systematic generalization remains challenging for current language models, which are known to be both sensitive to semantically similar permutations of the input and to struggle with known concepts presented in novel contexts. Although benchmarks exist for assessing compositional behavior, it is unclear how to measure the difficulty of a systematic generalization problem. In this work, we show how one aspect of systematic generalization can be described by the entropy of the distribution of component parts in the training data. We formalize a framework for measuring entropy in a sequence-to-sequence task and find that the performance of popular model architectures scales with the entropy. Our work connects systematic generalization to information efficiency, and our results indicate that success at high entropy can be achieved even without built-in priors, and that success at low entropy can serve as a target for assessing progress towards robust systematic generalization.
zh

[NLP-55] Advancing Sequential Numerical Prediction in Autoregressive Models ACL2025

【速读】: 该论文试图解决传统自回归模型在序列生成任务中将数字视为独立标记并使用交叉熵损失导致数值序列的连贯结构被忽视的问题。其解决方案的关键在于提出数值标记完整性损失(Numerical Token Integrity Loss, NTIL),该方法从两个层面进行优化:一是标记层面,通过扩展地球移动距离(Earth Mover’s Distance, EMD)来保持数值之间的序数关系;二是序列层面,通过惩罚预测序列与实际序列之间的整体差异,从而提升数值预测性能并有效集成到大语言模型(LLM)或多模态大语言模型(MLLM)中。

链接: https://arxiv.org/abs/2505.13077
作者: Xiang Fei,Jinghui Lu,Qi Sun,Hao Feng,Yanjie Wang,Wei Shi,An-Lan Wang,Jingqun Tang,Can Huang
机构: ByteDance Inc. (字节跳动公司); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2025 Main Conference

点击查看摘要

Abstract:Autoregressive models have become the de facto choice for sequence generation tasks, but standard approaches treat digits as independent tokens and apply cross-entropy loss, overlooking the coherent structure of numerical sequences. This paper introduces Numerical Token Integrity Loss (NTIL) to address this gap. NTIL operates at two levels: (1) token-level, where it extends the Earth Mover’s Distance (EMD) to preserve ordinal relationships between numerical values, and (2) sequence-level, where it penalizes the overall discrepancy between the predicted and actual sequences. This dual approach improves numerical prediction and integrates effectively with LLMs/MLLMs. Extensive experiments show significant performance improvements with NTIL.
zh

[NLP-56] Suicide Risk Assessment Using Multimodal Speech Features: A Study on the SW1 Challenge Dataset INTERSPEECH2025

【速读】: 该论文试图解决青少年基于语音的自杀风险评估问题(suicide risk assessment),旨在通过多模态方法提升评估的准确性。其解决方案的关键在于融合多种特征表示,包括自动转录(使用WhisperX)、汉语RoBERTa的语义嵌入、WavLM的音频嵌入以及手工提取的声学特征(如MFCCs、频谱对比和基频统计),并通过加权注意力机制与mixup正则化进行有效融合,以提高模型的泛化能力。

链接: https://arxiv.org/abs/2505.13069
作者: Ambre Marie,Ilias Maoudj,Guillaume Dardenne,Gwenolé Quellec
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to the SpeechWellness Challenge at Interspeech 2025; 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:The 1st SpeechWellness Challenge conveys the need for speech-based suicide risk assessment in adolescents. This study investigates a multimodal approach for this challenge, integrating automatic transcription with WhisperX, linguistic embeddings from Chinese RoBERTa, and audio embeddings from WavLM. Additionally, handcrafted acoustic features – including MFCCs, spectral contrast, and pitch-related statistics – were incorporated. We explored three fusion strategies: early concatenation, modality-specific processing, and weighted attention with mixup regularization. Results show that weighted attention provided the best generalization, achieving 69% accuracy on the development set, though a performance gap between development and test sets highlights generalization challenges. Our findings, strictly tied to the MINI-KID framework, emphasize the importance of refining embedding representations and fusion mechanisms to enhance classification reliability.
zh

[NLP-57] SNAPE-PM: Building and Utilizing Dynamic Partner Models for Adaptive Explanation Generation

【速读】: 该论文试图解决对话系统在生成解释时如何根据受众特征和交互上下文进行动态适应的问题,这一问题对于实现有效的解释至关重要但具有显著挑战性。解决方案的关键在于构建一个形式化的计算伙伴模型以跟踪交互上下文和相关听众特征,并利用该模型在一个非平稳的马尔可夫决策过程(non-stationary Markov Decision Process)中进行动态调整的理性决策,从而确定当前最优的解释策略。通过基于贝叶斯推理的方法持续更新伙伴模型,并结合非平稳马尔可夫决策过程进行决策调整,该方法展示了对不同对话者及变化反馈行为的有效适应能力。

链接: https://arxiv.org/abs/2505.13053
作者: Amelie S. Robrecht,Christoph R. Kowalski,Stefan Kopp
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: currently under review at Frontiers in Communication

点击查看摘要

Abstract:Adapting to the addressee is crucial for successful explanations, yet poses significant challenges for dialogsystems. We adopt the approach of treating explanation generation as a non-stationary decision process, where the optimal strategy varies according to changing beliefs about the explainee and the interaction context. In this paper we address the questions of (1) how to track the interaction context and the relevant listener features in a formally defined computational partner model, and (2) how to utilize this model in the dynamically adjusted, rational decision process that determines the currently best explanation strategy. We propose a Bayesian inference-based approach to continuously update the partner model based on user feedback, and a non-stationary Markov Decision Process to adjust decision-making based on the partner model values. We evaluate an implementation of this framework with five simulated interlocutors, demonstrating its effectiveness in adapting to different partners with constant and even changing feedback behavior. The results show high adaptivity with distinct explanation strategies emerging for different partners, highlighting the potential of our approach to improve explainable AI systems and dialogsystems in general.
zh

[NLP-58] KITs Offline Speech Translation and Instruction Following Submission for IWSLT 2025

【速读】: 该论文旨在解决离线语音翻译(Offline ST)和指令遵循(Instruction Following, IF)任务中的性能提升问题,特别是在面对复杂语言理解和生成需求时。其解决方案的关键在于充分利用大型语言模型(Large Language Models, LLMs)的能力,通过多阶段处理流程和上下文融合机制来增强系统性能。在Offline ST任务中,采用多自动语音识别系统输出的融合策略,并结合LLM进行文档级上下文处理;而在IF任务中,则构建了一个端到端模型,将语音编码器与LLM相结合,并引入文档级精炼阶段以进一步提升输出质量。

链接: https://arxiv.org/abs/2505.13036
作者: Sai Koneru,Maike Züfle,Thai-Binh Nguyen,Seymanur Akti,Jan Niehues,Alexander Waibel
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The scope of the International Workshop on Spoken Language Translation (IWSLT) has recently broadened beyond traditional Speech Translation (ST) to encompass a wider array of tasks, including Speech Question Answering and Summarization. This shift is partly driven by the growing capabilities of modern systems, particularly with the success of Large Language Models (LLMs). In this paper, we present the Karlsruhe Institute of Technology’s submissions for the Offline ST and Instruction Following (IF) tracks, where we leverage LLMs to enhance performance across all tasks. For the Offline ST track, we propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context. This is followed by a two-step translation process, incorporating additional refinement step to improve translation quality. For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks. We complement it with a final document-level refinement stage to further enhance output quality by using contextual information.
zh

[NLP-59] opicwizard – a Modern Model-agnostic Framework for Topic Model Visualization and Interpretation

【速读】: 该论文试图解决传统主题模型解释方法中存在的局限性,即用户通常仅依赖每个主题中排名前10的词来理解主题内容,这种方法提供的信息有限且可能存在偏差。解决方案的关键在于引入topicwizard框架,该框架提供直观且交互式的工具,帮助用户更全面、准确地理解主题模型所学习到的文档、词语和主题之间的复杂语义关系。

链接: https://arxiv.org/abs/2505.13034
作者: Márton Kardos,Kenneth C. Enevoldsen,Kristoffer Laigaard Nielbo
机构: Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:Topic models are statistical tools that allow their users to gain qualitative and quantitative insights into the contents of textual corpora without the need for close reading. They can be applied in a wide range of settings from discourse analysis, through pretraining data curation, to text filtering. Topic models are typically parameter-rich, complex models, and interpreting these parameters can be challenging for their users. It is typical practice for users to interpret topics based on the top 10 highest ranking terms on a given topic. This list-of-words approach, however, gives users a limited and biased picture of the content of topics. Thoughtful user interface design and visualizations can help users gain a more complete and accurate understanding of topic models’ output. While some visualization utilities do exist for topic models, these are typically limited to a certain type of topic model. We introduce topicwizard, a framework for model-agnostic topic model interpretation, that provides intuitive and interactive tools that help users examine the complex semantic relations between documents, words and topics learned by topic models.
zh

[NLP-60] MMAR: A Challenging Benchmark for Deep Reasoning in Speech Audio Music and Their Mix

【速读】: 该论文试图解决当前音频-语言模型(Audio-Language Models, ALMs)在深度推理能力评估方面的不足,特别是针对跨多学科任务的广泛真实音频场景缺乏系统性基准的问题。解决方案的关键在于构建MMAR,一个包含1,000个精心筛选的音频-问题-答案三元组的基准,覆盖声音、音乐和语音的混合模态场景,并通过分层的四类推理层级(信号层、感知层、语义层和文化层)及子类别来体现任务的多样性和复杂性。此外,每个问题均附有Chain-of-Thought(CoT)推理过程,以促进音频推理研究的发展。

链接: https://arxiv.org/abs/2505.13032
作者: Ziyang Ma,Yinghao Ma,Yanqiao Zhu,Chen Yang,Yi-Wen Chao,Ruiyang Xu,Wenxi Chen,Yuanzhe Chen,Zhuo Chen,Jian Cong,Kai Li,Keliang Li,Siyou Li,Xinfeng Li,Xiquan Li,Zheng Lian,Yuzhe Liang,Minghao Liu,Zhikang Niu,Tianrui Wang,Yuping Wang,Yuxuan Wang,Yihao Wu,Guanrou Yang,Jianwei Yu,Ruibin Yuan,Zhisheng Zheng,Ziya Zhou,Haina Zhu,Wei Xue,Emmanouil Benetos,Kai Yu,Eng-Siong Chng,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学); Queen Mary University of London (伦敦玛丽女王大学); ByteDance (字节跳动); Shanghai Innovation Institute (上海创新研究院); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); 2077AI; The Hong Kong University of Science and Technology (香港科技大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Open-source at this https URL

点击查看摘要

Abstract:We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark’s difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark’s challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.
zh

[NLP-61] Evaluatiing the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在工业应用中面临的安全威胁问题,特别是用户通过恶意查询导致系统泄露敏感数据或引发法律风险等攻击。其解决方案的关键在于对现有的LLM安全工具进行系统的比较分析,识别出有效的防护机制,并通过构建恶意提示基准数据集评估这些工具在对抗恶意输入时的性能。研究强调了在可用性与检测性能之间取得平衡的重要性,并提出了提升闭源工具透明度、改进上下文感知检测、增强开源社区参与度等改进建议。

链接: https://arxiv.org/abs/2505.13028
作者: Sayon Palit,Daniel Woods
机构: University of Edinburgh (爱丁堡大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into critical systems in industries like healthcare and finance. Users can often submit queries to LLM-enabled chatbots, some of which can enrich responses with information retrieved from internal databases storing sensitive data. This gives rise to a range of attacks in which a user submits a malicious query and the LLM-system outputs a response that creates harm to the owner, such as leaking internal data or creating legal liability by harming a third-party. While security tools are being developed to counter these threats, there is little formal evaluation of their effectiveness and usability. This study addresses this gap by conducting a thorough comparative analysis of LLM security tools. We identified 13 solutions (9 closed-source, 4 open-source), but only 7 were evaluated due to a lack of participation by proprietary model this http URL evaluate, we built a benchmark dataset of malicious prompts, and evaluate these tools performance against a baseline LLM model (ChatGPT-3.5-Turbo). Our results show that the baseline model has too many false positives to be used for this task. Lakera Guard and ProtectAI LLM Guard emerged as the best overall tools showcasing the tradeoff between usability and performance. The study concluded with recommendations for greater transparency among closed source providers, improved context-aware detections, enhanced open-source engagement, increased user awareness, and the adoption of more representative performance metrics.
zh

[NLP-62] o Bias or Not to Bias: Detecting bias in News with bias-detector

【速读】: 该论文试图解决媒体偏见检测中的挑战,特别是在缺乏高质量标注数据和偏见主观性导致的准确性不足问题。其解决方案的关键在于对基于RoBERTa的模型进行微调,利用专家标注的BABE数据集实现句子级别的偏见分类,并通过统计检验验证了模型性能的显著提升。此外,通过注意力机制分析,模型能够避免对政治敏感词的过度敏感,更关注语境相关的标记,从而提高了检测的合理性和可解释性。

链接: https://arxiv.org/abs/2505.13010
作者: Himel Ghosh,Ahmed Mosharafa,Georg Groh
机构: Technical University of Munich (TUM), Germany; Sapienza University of Rome, Italy
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Media bias detection is a critical task in ensuring fair and balanced information dissemination, yet it remains challenging due to the subjectivity of bias and the scarcity of high-quality annotated data. In this work, we perform sentence-level bias classification by fine-tuning a RoBERTa-based model on the expert-annotated BABE dataset. Using McNemar’s test and the 5x2 cross-validation paired t-test, we show statistically significant improvements in performance when comparing our model to a domain-adaptively pre-trained DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model avoids common pitfalls like oversensitivity to politically charged terms and instead attends more meaningfully to contextually relevant tokens. For a comprehensive examination of media bias, we present a pipeline that combines our model with an already-existing bias-type classifier. Our method exhibits good generalization and interpretability, despite being constrained by sentence-level analysis and dataset size because of a lack of larger and more advanced bias corpora. We talk about context-aware modeling, bias neutralization, and advanced bias type classification as potential future directions. Our findings contribute to building more robust, explainable, and socially responsible NLP systems for media bias detection.
zh

[NLP-63] Evaluating the Performance of RAG Methods for Conversational AI in the Airport Domain NAACL2025

【速读】: 该论文旨在解决机场环境中信息查询与沟通的效率与准确性问题,特别是在面对复杂、动态及专业术语密集的航班信息时。解决方案的关键在于构建三种不同的检索增强生成(Retrieval-Augmented Generation, RAG)方法,包括传统RAG、SQL RAG和基于知识图谱的RAG(Graph RAG),以提升系统在处理机场术语、缩写及需要推理的动态问题时的准确性和可靠性。其中,Graph RAG在处理涉及推理的问题上表现尤为突出,且相比传统RAG具有更少的幻觉现象,因此被推荐用于机场环境。

链接: https://arxiv.org/abs/2505.13006
作者: Yuyang Li,Philip J.M. Kerbusch,Raimon H.R. Pruim,Tobias Käfer
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Royal Schiphol Group (皇家史基浦集团)
类目: Computation and Language (cs.CL)
备注: Accepted by NAACL 2025 industry track

点击查看摘要

Abstract:Airports from the top 20 in terms of annual passengers are highly dynamic environments with thousands of flights daily, and they aim to increase the degree of automation. To contribute to this, we implemented a Conversational AI system that enables staff in an airport to communicate with flight information systems. This system not only answers standard airport queries but also resolves airport terminology, jargon, abbreviations, and dynamic questions involving reasoning. In this paper, we built three different Retrieval-Augmented Generation (RAG) methods, including traditional RAG, SQL RAG, and Knowledge Graph-based RAG (Graph RAG). Experiments showed that traditional RAG achieved 84.84% accuracy using BM25 + GPT-4 but occasionally produced hallucinations, which is risky to airport safety. In contrast, SQL RAG and Graph RAG achieved 80.85% and 91.49% accuracy respectively, with significantly fewer hallucinations. Moreover, Graph RAG was especially effective for questions that involved reasoning. Based on our observations, we thus recommend SQL RAG and Graph RAG are better for airport environments, due to fewer hallucinations and the ability to handle dynamic questions.
zh

[NLP-64] EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM -Generated Code

【速读】: 该论文试图解决现有代码生成基准主要评估功能正确性,而对代码效率关注有限,并且通常仅限于单一语言(如Python)的问题。其解决方案的关键是引入EffiBench-X,这是首个多语言基准,旨在衡量大语言模型(LLM)生成代码的效率。EffiBench-X支持多种编程语言,并包含具有人类专家解决方案的竞赛编程任务作为效率基线,从而能够全面评估LLM生成代码在不同语言中的效率表现。

链接: https://arxiv.org/abs/2505.13004
作者: Yuhao Qing,Boyu Zhu,Mingzhe Du,Zhijiang Guo,Terry Yue Zhuo,Qianru Zhang,Jie M. Zhang,Heming Cui,Siu-Ming Yiu,Dong Huang,See-Kiong Ng,Luu Anh Tuan
机构: HKU(香港大学); UCL(伦敦大学学院); NTU(南洋理工大学); NUS(新加坡国立大学); HKUST (GZ)(香港科技大学(广州)); HKUST(香港科技大学); Monash University(莫纳什大学); CSIRO’s Data61(澳大利亚联邦科学与工业研究组织的数据61实验室); KCL(伦敦国王学院)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around \textbf62% of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++, and Golang. For instance, DeepSeek-R1’s Python code is significantly more efficient than its Java code. These results highlight the critical need for research into LLM optimization techniques to improve code efficiency across diverse languages. The dataset and evaluation infrastructure are submitted and available at this https URL and this https URL.
zh

[NLP-65] ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning

【速读】: 该论文旨在解决在机器翻译(Machine Translation, MT)中如何充分利用大型推理模型(Large Reasoning Models, LRMs)的推理能力,以提升翻译质量的问题。其关键解决方案是设计一种新的奖励建模方法,该方法通过将策略模型的翻译结果与强大的LRM(如DeepSeek-R1-671B)进行比较,并量化比较结果以生成奖励信号,从而更有效地利用强化学习(Reinforcement Learning, RL)的潜力。此方法在文学翻译任务中取得了当前最优性能,并成功扩展至多语言场景,实现了从单一方向到90个翻译方向的高效迁移。

链接: https://arxiv.org/abs/2505.12996
作者: Jiaan Wang,Fandong Meng,Jie Zhou
机构: Tencent Inc (腾讯公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:In recent years, the emergence of large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, has shown impressive capabilities in complex problems, e.g., mathematics and coding. Some pioneering studies attempt to bring the success of LRMs in neural machine translation (MT). They try to build LRMs with deep reasoning MT ability via reinforcement learning (RL). Despite some progress that has been made, these attempts generally focus on several high-resource languages, e.g., English and Chinese, leaving the performance on other languages unclear. Besides, the reward modeling methods in previous work do not fully unleash the potential of reinforcement learning in MT. In this work, we first design a new reward modeling method that compares the translation results of the policy MT model with a strong LRM (i.e., DeepSeek-R1-671B), and quantifies the comparisons to provide rewards. Experimental results demonstrate the superiority of the reward modeling method. Using Qwen2.5-7B-Instruct as the backbone, the trained model achieves the new state-of-the-art performance in literary translation, and outperforms strong LRMs including OpenAI-o1 and DeepSeeK-R1. Furthermore, we extend our method to the multilingual settings with 11 languages. With a carefully designed lightweight reward modeling in RL, we can simply transfer the strong MT ability from a single direction into multiple (i.e., 90) translation directions and achieve impressive multilingual MT performance.
zh

[NLP-66] Fractured Chain-of-Thought Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因使用Chain-of-Thought (CoT)提示方法而导致的高token消耗问题,从而限制其在延迟敏感场景中的部署。论文提出的解决方案关键在于引入Fractured Sampling,这是一种统一的推理阶段策略,通过在三个正交维度上对完整CoT与仅输出答案的采样方式进行插值,即推理轨迹数量、每条轨迹的最终解数量以及推理过程的截断深度,从而实现更高效的计算资源分配和更好的准确率-成本权衡。

链接: https://arxiv.org/abs/2505.12992
作者: Baohao Liao,Hanze Dong,Yuhui Xu,Doyen Sahoo,Christof Monz,Junnan Li,Caiming Xiong
机构: University of Amsterdam (阿姆斯特丹大学); Salesforce AI Research (Salesforce AI Research)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning.
zh

[NLP-67] An Empirical Study of Many-to-Many Summarization with Large Language Models ACL2025

【速读】: 该论文旨在解决多对多摘要(Many-to-many summarization, M2MS)问题,即处理任意语言的文档并生成对应语言的摘要。其解决方案的关键在于系统性地评估大型语言模型(Large Language Models, LLMs)在M2MS任务中的表现,并通过零样本和指令微调两种方式提升其性能。研究构建了一个涵盖五个领域和六种语言的多样化数据集,用于训练和评估LLMs,并发现指令微调能够显著提升开源LLMs的M2MS能力,同时保持其通用任务求解能力,但同时也揭示了事实性错误问题仍需进一步解决。

链接: https://arxiv.org/abs/2505.12983
作者: Jiaan Wang,Fandong Meng,Zengkui Sun,Yunlong Liang,Yuxuan Cao,Jiarong Xu,Haoxiang Shi,Jie Zhou
机构: Tencent Inc(腾讯公司); Fudan Unversity(复旦大学); Waseda University(早稻田大学); Beijing Jiaotong University(北京交通大学); Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 main conference

点击查看摘要

Abstract:Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs’ M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate that this task-specific improvement does not sacrifice the LLMs’ general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worth noting in future research.
zh

[NLP-68] Fast Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models

【速读】: 该论文试图解决图素到音素(Grapheme-to-Phoneme, G2P)转换中的同形异义词消歧问题,尤其是在低资源语言中的挑战。该问题的两个关键方面是:构建平衡且全面的同形异义词数据集成本高且耗时,以及特定的消歧策略会引入额外延迟,不适合实时应用如屏幕阅读器等。论文的关键解决方案是提出一种半自动化数据集构建流程,并生成了HomoRich数据集以提升深度学习G2P系统的性能;同时倡导采用丰富的离线数据集来优化适用于低延迟场景的规则型方法,具体表现为改进eSpeak系统为快速的同形异义词感知版本HomoFast eSpeak,从而在准确率上实现了约30%的提升。

链接: https://arxiv.org/abs/2505.12973
作者: Mahta Fetrat Qharabagh,Zahra Dehghanian,Hamid R. Rabiee
机构: Sharif University of Technology (沙里夫理工大学)
类目: Computation and Language (cs.CL)
备注: 8 main body pages, total 25 pages, 15 figures

点击查看摘要

Abstract:Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for latency-sensitive accessibility applications like screen readers. To this end, we improve one of the most well-known rule-based G2P systems, eSpeak, into a fast homograph-aware version, HomoFast eSpeak. Our results show an approximate 30% improvement in homograph disambiguation accuracy for the deep learning-based and eSpeak systems.
zh

[NLP-69] A Structured Literature Review on Traditional Approaches in Current Natural Language Processing

【速读】: 该论文旨在评估传统方法在五个典型自然语言处理应用场景中的现状及未来潜力,尽管大型语言模型取得了显著成功,但传统方法仍被广泛使用。其关键在于通过分析分类、信息与关系抽取、文本简化以及文本摘要等任务中传统技术的使用方式,揭示它们在当前研究中的角色,如作为处理流程的一部分、核心模型的对比基准或主要模型。

链接: https://arxiv.org/abs/2505.12970
作者: Robin Jegan,Andreas Henrich
机构: Otto-Friedrich-Universität Bamberg, Lehrstuhl für Medieninformatik (奥托·弗里德里希大学班贝格分校,媒体信息学教席)
类目: Computation and Language (cs.CL)
备注: 14 pages, 1 figure

点击查看摘要

Abstract:The continued rise of neural networks and large language models in the more recent past has altered the natural language processing landscape, enabling new approaches towards typical language tasks and achieving mainstream success. Despite the huge success of large language models, many disadvantages still remain and through this work we assess the state of the art in five application scenarios with a particular focus on the future perspectives and sensible application scenarios of traditional and older approaches and techniques. In this paper we survey recent publications in the application scenarios classification, information and relation extraction, text simplification as well as text summarization. After defining our terminology, i.e., which features are characteristic for traditional techniques in our interpretation for the five scenarios, we survey if such traditional approaches are still being used, and if so, in what way they are used. It turns out that all five application scenarios still exhibit traditional models in one way or another, as part of a processing pipeline, as a comparison/baseline to the core model of the respective paper, or as the main model(s) of the paper. For the complete statistics, see this https URL Comments: 14 pages, 1 figure Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.12970 [cs.CL] (or arXiv:2505.12970v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.12970 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-70] Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down INTERSPEECH2025

【速读】: 该论文旨在解决OpenAI的Whisper模型在非语音段落中出现的幻觉(hallucination)问题,这一问题限制了其在复杂工业场景中的广泛应用。论文提出的解决方案关键在于通过逐头掩码(head-wise mask)分析Whisper-large-v3解码器中每个自注意力头对幻觉的贡献,并发现仅有3个头负责超过75%的幻觉现象,随后对这3个头进行微调,使用非语音数据以减少幻觉,最终提出的Calm-Whisper模型在非语音幻觉上实现了超过80%的降低,同时仅带来小于0.1%的词错误率(WER)下降。

链接: https://arxiv.org/abs/2505.12969
作者: Yingzhi Wang,Anas Alhmoud,Saad Alsahly,Muhammad Alqurishi,Mirco Ravanelli
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:OpenAI’s Whisper has achieved significant success in Automatic Speech Recognition. However, it has consistently been found to exhibit hallucination issues, particularly in non-speech segments, which limits its broader application in complex industrial settings. In this paper, we introduce a novel method to reduce Whisper’s hallucination on non-speech segments without using any pre- or post-possessing techniques. Specifically, we benchmark the contribution of each self-attentional head in the Whisper-large-v3 decoder to the hallucination problem by performing a head-wise mask. Our findings reveal that only 3 of the 20 heads account for over 75% of the hallucinations on the UrbanSound dataset. We then fine-tune these three crazy heads using a collection of non-speech data. The results show that our best fine-tuned model, namely Calm-Whisper, achieves over 80% reduction in non-speech hallucination with only less than 0.1% WER degradation on LibriSpeech test-clean and test-other. Comments: Accepted to Interspeech 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.12969 [cs.CL] (or arXiv:2505.12969v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.12969 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-71] MA-COIR: Leverag ing Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition

【速读】: 该论文旨在解决生物医学文本中概念识别的问题,特别是传统方法在识别未明确提及的复杂概念时存在的局限性。其解决方案的关键在于引入MA-COIR框架,将概念识别重新定义为索引-识别任务,通过为概念分配语义搜索索引(ssIDs)来解决本体库条目中的歧义并提高识别效率。该方法基于微调的小数据集上的预训练BART模型,并结合大语言模型(LLM)生成的查询和合成数据,在低资源环境下提升了识别效果。

链接: https://arxiv.org/abs/2505.12964
作者: Shanshan Liu,Noriki Nishida,Rumana Ferdous Munne,Narumi Tokunaga,Yuki Yamagata,Kouji Kozaki,Yuji Matsumoto
机构: RIKEN AIP (理化学研究所人工智能中心); University of Tsukuba (筑波大学); RIKEN R-IH (理化学研究所生命医科学综合研究中心); RIKEN BRC (理化学研究所生物资源中心); Osaka Electro-Communication University (大阪电気通信大学)
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at this https URL.
zh

[NLP-72] GuRE:Generative Query REwriter for Legal Passage Retrieval

【速读】: 该论文试图解决法律文本检索(Legal Passage Retrieval, LPR)中查询与目标段落之间存在的显著词汇不匹配问题,这一问题限制了现有系统的性能。解决方案的关键在于提出一种简单而有效的方法——生成式查询重写(Generative query REwriter, GuRE),通过利用大语言模型(Large Language Models, LLMs)的生成能力对查询进行重写,从而缓解词汇不匹配问题,提升检索效果。实验结果表明,GuRE在不依赖特定检索器的情况下显著提升了性能。

链接: https://arxiv.org/abs/2505.12950
作者: Daehee Kim,Deokhyung Kang,Jonghwi Kim,Sangwon Ryu,Gary Geunbae Lee
机构: POSTECH(韩国浦项科技大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Legal Passage Retrieval (LPR) systems are crucial as they help practitioners save time when drafting legal arguments. However, it remains an underexplored avenue. One primary reason is the significant vocabulary mismatch between the query and the target passage. To address this, we propose a simple yet effective method, the Generative query REwriter (GuRE). We leverage the generative capabilities of Large Language Models (LLMs) by training the LLM for query rewriting. “Rewritten queries” help retrievers to retrieve target passages by mitigating vocabulary mismatch. Experimental results show that GuRE significantly improves performance in a retriever-agnostic manner, outperforming all baseline methods. Further analysis reveals that different training objectives lead to distinct retrieval behaviors, making GuRE more suitable than direct retriever fine-tuning for real-world applications. Codes are avaiable at this http URL.
zh

[NLP-73] Neural Morphological Tagging for Nguni Languages

【速读】: 该论文试图解决的是形态学解析(morphological parsing)问题,即如何将词语分解为最小的语义单位——词素,并标注其语法角色。对于像南非恩古尼语(Nguni languages)这样的黏着语,这一任务尤为复杂。论文的关键解决方案是采用神经网络方法构建形态学标注器,通过比较从头训练的神经序列标注器(如LSTMs和神经CRFs)与微调预训练语言模型的效果,发现神经标注器在性能上显著优于传统基于规则的解析器,且从头训练的模型通常优于预训练模型。此外,研究还探讨了不同上游分段器及语言输入特征对解析结果的影响。

链接: https://arxiv.org/abs/2505.12949
作者: Cael Marquard,Simbarashe Mawere,Francois Meyer
机构: University of Cape Town (开普敦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Morphological parsing is the task of decomposing words into morphemes, the smallest units of meaning in a language, and labelling their grammatical roles. It is a particularly challenging task for agglutinative languages, such as the Nguni languages of South Africa, which construct words by concatenating multiple morphemes. A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger. This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages. We compare two classes of approaches: training neural sequence labellers (LSTMs and neural CRFs) from scratch and finetuning pretrained language models. We compare performance across these two categories, as well as to a traditional rule-based morphological parser. Neural taggers comfortably outperform the rule-based baseline and models trained from scratch tend to outperform pretrained models. We also compare parsing results across different upstream segmenters and with varying linguistic input features. Our findings confirm the viability of employing neural taggers based on pre-existing morphological segmenters for the Nguni languages.
zh

[NLP-74] A3 : an Analytical Low-Rank Approximation Framework for Attention

【速读】: 该论文旨在解决大规模语言模型在部署时因参数量过大而导致的高成本问题,以及现有低秩近似方法在考虑Transformer架构特性与运行时开销方面的不足。其解决方案的关键在于提出一种后训练低秩近似框架 \tt A^\tt 3 ,该框架将Transformer层划分为三个功能组件(\tt QK、\tt OV和\tt MLP),并对每个组件提供解析解,以在最小化功能损失的同时减少隐藏维度大小,从而直接降低模型规模、KV缓存大小和FLOPs,且不引入任何运行时开销。

链接: https://arxiv.org/abs/2505.12942
作者: Jeffrey T. H. Wong,Cheng Zhang,Xinye Cao,Pedro Gimenes,George A. Constantinides,Wayne Luk,Yiren Zhao
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have demonstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices. Consequently, these methods often fall short compared to other compression techniques like pruning and quantization, and introduce runtime overhead such as the extra GEMM kernel launches for decomposed small matrices. To address these limitations, we propose \tt A^\tt 3 , a post-training low-rank approximation framework. \tt A^\tt 3 splits a Transformer layer into three functional components, namely \tt QK , \tt OV , and \tt MLP . For each component, \tt A^\tt 3 provides an analytical solution that reduces the hidden dimension size inside each component while minimizing the component’s functional loss ( \it i.e. , error in attention scores, attention outputs, and MLP outputs). This approach directly reduces model sizes, KV cache sizes, and FLOPs without introducing any runtime overheads. In addition, it provides a new narrative in advancing the optimization problem from singular linear layer loss optimization toward improved end-to-end performance. Through extensive experiments, we show that \tt A^\tt 3 maintains superior performance compared to SoTAs. For example, under the same reduction budget in computation and memory, our low-rank approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2, outperforming the previous SoTA’s 7.87 by 3.18. We also demonstrate the versatility of \tt A^\tt 3 , including KV cache compression, quantization, and mixed-rank assignments for enhanced performance.
zh

[NLP-75] Leverag ing LLM Inconsistency to Boost Pass@k Performance

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对微小输入变化时表现不一致的问题,而非将其视为缺陷。论文提出了一种新颖的方法,利用模型的不一致性来提升Pass@k性能。解决方案的关键在于引入一个名为“Variator”的智能体,该智能体能够生成给定任务的k个变体,并为每个变体提交一个候选解。该方法具有任务无关性,适用于多种领域,并且兼容自由格式输入,从而有效提升了模型在复杂任务中的稳定性与准确性。

链接: https://arxiv.org/abs/2505.12938
作者: Uri Dalal,Meirav Segal,Zvika Ben-Haim,Dan Lahav,Omer Nevo
机构: Pattern Labs (模式实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve impressive abilities in numerous domains, but exhibit inconsistent performance in response to minor input changes. Rather than view this as a drawback, in this paper we introduce a novel method for leveraging models’ inconsistency to boost Pass@k performance. Specifically, we present a “Variator” agent that generates k variants of a given task and submits one candidate solution for each one. Our variant generation approach is applicable to a wide range of domains as it is task agnostic and compatible with free-form inputs. We demonstrate the efficacy of our agent theoretically using a probabilistic model of the inconsistency effect, and show empirically that it outperforms the baseline on the APPS dataset. Furthermore, we establish that inconsistency persists even in frontier reasoning models across coding and cybersecurity domains, suggesting our method is likely to remain relevant for future model generations.
zh

[NLP-76] Do Not Let Low-Probability Tokens Over-Dominate in RL for LLM s

【速读】: 该论文试图解决强化学习(Reinforcement Learning, RL)训练中低概率标记(low-probability tokens)因梯度幅度较大而对模型更新产生过度影响的问题,这一现象抑制了高概率标记(high-probability tokens)的梯度,从而影响大型语言模型(Large Language Models, LLMs)的性能。解决方案的关键在于提出两种新方法:优势重加权(Advantage Reweighting)和低概率标记隔离(Low-Probability Token Isolation, Lopti),二者均能有效衰减低概率标记的梯度,同时强调由高概率标记驱动的参数更新,从而实现不同概率标记间的平衡更新,提升RL训练效率。

链接: https://arxiv.org/abs/2505.12929
作者: Zhihe Yang,Xufang Luo,Zilong Wang,Dongqi Han,Zhiyuan He,Dongsheng Li,Yunjian Xu
机构: The Chinese University of Hong Kong, Hong Kong SAR, China. ; Microsoft Research Asia, Shanghai, China.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 12 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs’ performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in KK Logic Puzzle reasoning tasks. Our implementation is available at this https URL.
zh

[NLP-77] PyFCG: Fluid Construction Grammar in Python

【速读】: 该论文试图解决将流体构式语法(Fluid Construction Grammar, FCG)集成到Python编程环境中的问题,以便用户能够更方便地在Python生态系统中使用FCG功能。解决方案的关键在于开发了一个名为PyFCG的开源软件库,该库实现了FCG的核心功能,并支持与Python其他库的无缝集成,从而提升了FCG在自然语言处理及相关领域的应用灵活性和扩展性。

链接: https://arxiv.org/abs/2505.12920
作者: Paul Van Eecke,Katrien Beuls
机构: Vrije Universiteit Brussel (布鲁塞尔自由大学); Université de Namur (纳穆尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We present PyFCG, an open source software library that ports Fluid Construction Grammar (FCG) to the Python programming language. PyFCG enables its users to seamlessly integrate FCG functionality into Python programs, and to use FCG in combination with other libraries within Python’s rich ecosystem. Apart from a general description of the library, this paper provides three walkthrough tutorials that demonstrate example usage of PyFCG in typical use cases of FCG: (i) formalising and testing construction grammar analyses, (ii) learning usage-based construction grammars from corpora, and (iii) implementing agent-based experiments on emergent communication.
zh

[NLP-78] AutoGEEval: A Multimodal and Automated Framework for Geospatial Code Generation on GEE with Large Language Models

【速读】: 该论文试图解决地理空间代码生成领域缺乏标准化自动评估工具的问题。解决方案的关键在于提出AutoGEEval,这是首个基于大型语言模型(Large Language Models, LLMs)的多模态、单元级自动化评估框架,旨在对Google Earth Engine(GEE)平台上的地理空间代码生成任务进行端到端的自动评估,通过集成问题生成与答案验证组件,实现从函数调用到执行验证的全流程自动化,并支持多维定量分析模型输出的准确性、资源消耗、执行效率及错误类型。

链接: https://arxiv.org/abs/2505.12900
作者: Shuyang Hou,Zhangxiao Shen,Huayi Wu,Jianyuan Liang,Haoyue Jiao,Yaxian Qing,Xiaopu Zhang,Xu Li,Zhipeng Gui,Xuefeng Guan,Longgang Xiang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline-from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs-including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models-revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation.
zh

[NLP-79] On the Thinking-Language Modeling Gap in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在进行System 2推理时存在的语言建模偏差问题,即模型在模拟人类思维过程时容易受到语言偏见的影响,导致推理过程偏离实际的因果逻辑。解决方案的关键在于提出一种称为Language-of-Thoughts (LoT)的提示技术,通过调整模型对所有相关信息的表达顺序和标记使用方式,减少语言建模偏差,并提升模型在多种推理任务中的表现。

链接: https://arxiv.org/abs/2505.12896
作者: Chenxi Liu,Yongqiang Chen,Tongliang Liu,James Cheng,Bo Han,Kun Zhang
机构: Hong Kong Baptist University (香港浸会大学); MBZUAI (MBZUAI); Carnegie Mellon University (卡内基梅隆大学); The University of Sydney (悉尼大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Chenxi and Yongqiang contributed equally; project page: this https URL

点击查看摘要

Abstract:System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of “thoughts” in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap. Instead of directly eliciting the chain of thoughts from partial information, LoT instructs LLMs to adjust the order and token used for the expressions of all the relevant information. We show that the simple strategy significantly reduces the language modeling biases in LLMs and improves the performance of LLMs across a variety of reasoning tasks.
zh

[NLP-80] IME: A Multi-level Benchmark for Temporal Reasoning of LLM s in Real-World Scenarios

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在现实世界中进行时间推理时面临的三大挑战:密集的时间信息、快速变化的事件动态以及社会互动中的复杂时间依赖关系。其解决方案的关键是提出一个多层次的基准测试框架TIME,该框架包含38,522个问答对,覆盖三个层次和11个细粒度子任务,并包含三个反映不同现实挑战的子数据集:TIME-Wiki、TIME-News和TIME-Dial,以全面评估和促进时间推理能力的研究。

链接: https://arxiv.org/abs/2505.12891
作者: Shaohang Wei,Wei Li,Feifan Song,Wen Luo,Tianyi Zhuang,Haochen Tan,Zhijiang Guo,Houfeng Wang
机构: Peking University (北京大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First version. There are still some examples to be added into the appendix

点击查看摘要

Abstract:Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at this https URL , and the dataset is available at this https URL .
zh

[NLP-81] GAP: Graph-Assisted Prompts for Dialogue-based Medication Recommendation

【速读】: 该论文旨在解决对话式药物推荐中的准确性与安全性问题,特别是在多轮对话中,大型语言模型(LLM)可能忽略细粒度的医学信息或对话轮次间的关联性,以及在缺乏领域专业知识时生成非事实性响应的问题。解决方案的关键在于提出一种图辅助提示(Graph-Assisted Prompts, GAP)框架,该框架通过从对话中提取医学概念及其状态构建显式的以患者为中心的图,从而描述被忽视但重要的信息,并结合外部医学知识图谱生成丰富的查询和提示,以从多源信息中检索内容并减少非事实性响应。

链接: https://arxiv.org/abs/2505.12888
作者: Jialun Zhong,Yanzeng Li,Sen Hu,Yang Zhang,Teng Xu,Lei Zou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medication recommendations have become an important task in the healthcare domain, especially in measuring the accuracy and safety of medical dialogue systems (MDS). Different from the recommendation task based on electronic health records (EHRs), dialogue-based medication recommendations require research on the interaction details between patients and doctors, which is crucial but may not exist in EHRs. Recent advancements in large language models (LLM) have extended the medical dialogue domain. These LLMs can interpret patients’ intent and provide medical suggestions including medication recommendations, but some challenges are still worth attention. During a multi-turn dialogue, LLMs may ignore the fine-grained medical information or connections across the dialogue turns, which is vital for providing accurate suggestions. Besides, LLMs may generate non-factual responses when there is a lack of domain-specific knowledge, which is more risky in the medical domain. To address these challenges, we propose a \textbfGraph-\textbfAssisted \textbfPrompts (\textbfGAP) framework for dialogue-based medication recommendation. It extracts medical concepts and corresponding states from dialogue to construct an explicitly patient-centric graph, which can describe the neglected but important information. Further, combined with external medical knowledge graphs, GAP can generate abundant queries and prompts, thus retrieving information from multiple sources to reduce the non-factual responses. We evaluate GAP on a dialogue-based medication recommendation dataset and further explore its potential in a more difficult scenario, dynamically diagnostic interviewing. Extensive experiments demonstrate its competitive performance when compared with strong baselines.
zh

[NLP-82] Detection and Mitigation of Hallucination in Large Reasoning Models: A Mechanistic Perspective

【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)中出现的推理幻觉(Reasoning Hallucination)问题,即模型在逻辑上连贯但事实错误的推理过程中得出具有说服力却错误的结论。解决方案的关键在于提出一种名为推理得分(Reasoning Score)的量化指标,通过测量LRMs后期层投影到词汇空间后的logits差异,有效区分浅层模式匹配与真正的深度推理。这一指标为后续的推理幻觉检测与缓解提供了基础。

链接: https://arxiv.org/abs/2505.12886
作者: Zhongxiang Sun,Qipeng Wang,Haoyu Wang,Xiao Zhang,Jun Xu
机构: Renmin University of China (中国人民大学); Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education (教育部下一代智能搜索与推荐工程研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 25 pages

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have shown impressive capabilities in multi-step reasoning tasks. However, alongside these successes, a more deceptive form of model error has emerged–Reasoning Hallucination–where logically coherent but factually incorrect reasoning traces lead to persuasive yet faulty conclusions. Unlike traditional hallucinations, these errors are embedded within structured reasoning, making them more difficult to detect and potentially more harmful. In this work, we investigate reasoning hallucinations from a mechanistic perspective. We propose the Reasoning Score, which quantifies the depth of reasoning by measuring the divergence between logits obtained from projecting late layers of LRMs to the vocabulary space, effectively distinguishing shallow pattern-matching from genuine deep reasoning. Using this score, we conduct an in-depth analysis on the ReTruthQA dataset and identify two key reasoning hallucination patterns: early-stage fluctuation in reasoning depth and incorrect backtracking to flawed prior steps. These insights motivate our Reasoning Hallucination Detection (RHD) framework, which achieves state-of-the-art performance across multiple domains. To mitigate reasoning hallucinations, we further introduce GRPO-R, an enhanced reinforcement learning algorithm that incorporates step-level deep reasoning rewards via potential-based shaping. Our theoretical analysis establishes stronger generalization guarantees, and experiments demonstrate improved reasoning quality and reduced hallucination rates.
zh

[NLP-83] Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks? ICML25

【速读】: 该论文试图解决低秩适配(LoRA)在微调大语言模型(LLMs)过程中面对训练时攻击(如数据中毒和后门攻击)时的安全性问题。解决方案的关键在于提出一个分析框架,该框架通过建模LoRA的训练动态、利用神经切线核(NTK)简化训练过程的分析,并结合信息论建立LoRA低秩结构与其对训练时攻击易受性之间的联系,从而揭示其在安全性方面的特性。

链接: https://arxiv.org/abs/2505.12871
作者: Zi Liang,Haibo Hu,Qingqing Ye,Yaxin Xiao,Ronghua Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: To appear at ICML 25

点击查看摘要

Abstract:Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA’s low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA’s training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA’s low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.
zh

[NLP-84] LEXam: Benchmarking Legal Reasoning on 340 Law Exams

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在长篇法律推理任务中的挑战,尤其是在测试时扩展技术取得进展的背景下,这一问题依然存在。其解决方案的关键在于引入LEXam,这是一个从340份法律考试中提取的新型基准数据集,覆盖了116门法学院课程,涉及多种法律主题和学位层次。该数据集包含4,886道英文和德文法律考试题,包括2,841道需要长篇开放性回答的问题和2,045道选择题,并为开放性问题提供了明确的法律推理指导,如问题识别、规则回忆或规则应用。通过这一数据集,研究者能够有效评估LLMs在结构化、多步骤法律推理任务中的表现,并采用“大语言模型作为法官”范式结合严格的人工专家验证,实现对模型生成推理步骤的一致且准确评估。

链接: https://arxiv.org/abs/2505.12864
作者: Yu Fan,Jingwei Ni,Jakob Merane,Etienne Salimbeni,Yang Tian,Yoan Hermstrüwer,Yinya Huang,Mubashara Akhtar,Florian Geering,Oliver Dreyer,Daniel Brunner,Markus Leippold,Mrinmaya Sachan,Alexander Stremitzer,Christoph Engel,Elliott Ash,Joel Niklaus
机构: EETH Zurich(ETH Zurich); ZUniversity of Zurich(University of Zurich); LUniversity of Lausanne(University of Lausanne); MMax Planck Institute for Research on Collective Goods(Max Planck Institute for Research on Collective Goods); OOmnilex(Omnilex); SUniversity of St. Gallen(University of St. Gallen); CSwiss Federal Supreme Court(Swiss Federal Supreme Court); NNiklaus.ai(Niklaus.ai)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: this https URL
zh

[NLP-85] Re-identification of De-identified Documents with Autoregressive Infilling ACL2025

【速读】: 该论文试图解决去标识化(de-identification)方法在保护个人敏感信息方面的脆弱性问题,即如何评估现有去标识化技术在面对逆向重新识别(re-identification)时的鲁棒性。解决方案的关键在于提出一种受检索增强生成(RAG)启发的新型方法,该方法通过结合背景知识库中的文档片段,利用填充模型(infilling model)逐步推断并恢复被遮蔽的文本片段,从而实现对去标识化文本的重新识别。

链接: https://arxiv.org/abs/2505.12859
作者: Lucas Georges Gabriel Charpentier,Pierre Lison
机构: University of Oslo (奥斯陆大学); Norwegian Computing Center (挪威计算中心)
类目: Computation and Language (cs.CL)
备注: To be presented a ACL 2025, Main, Long paper

点击查看摘要

Abstract:Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.
zh

[NLP-86] GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents

【速读】: 该论文试图解决GUI agents在面对超出分布(out-of-distribution, OOD)的指令时可能出现的任务失败或安全威胁问题,其核心挑战在于传统OOD检测方法在复杂的嵌入空间和动态变化的GUI环境中表现不佳。解决方案的关键在于观察到GUI agents的分布内输入语义空间相对于中心点的距离呈现出聚类模式,并基于此提出GEM方法,该方法通过在GUI Agent提取的输入嵌入距离上拟合高斯混合模型来反映其能力边界,从而实现有效的OOD检测。

链接: https://arxiv.org/abs/2505.12842
作者: Zheng Wu,Pengzhou Cheng,Zongru Wu,Lingzhong Dong,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graphical user interface (GUI) agents have recently emerged as an intriguing paradigm for human-computer interaction, capable of automatically executing user instructions to operate intelligent terminal devices. However, when encountering out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of agents, GUI agents may suffer task breakdowns or even pose security threats. Therefore, effective OOD detection for GUI agents is essential. Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments. In this work, we observe that the in-distribution input semantic space of GUI agents exhibits a clustering pattern with respect to the distance from the centroid. Based on the finding, we propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI Agent that reflect its capability boundary. Evaluated on eight datasets spanning smartphones, computers, and web browsers, our method achieves an average accuracy improvement of 23.70% over the best-performing baseline. Analysis verifies the generalization ability of our method through experiments on nine different backbones. The codes are available at this https URL.
zh

[NLP-87] he Hidden Structure – Improving Legal Document Understanding Through Explicit Text Formatting

【速读】: 该论文试图解决法律文本中输入结构对大型语言模型(Large Language Model, LLM)在法律问答任务中性能影响的问题,特别是探讨了显式输入结构和提示工程对GPT-4o和GPT-4.1模型表现的影响。解决方案的关键在于通过优化输入格式(如原始CUAD文本、GPT-4o Vision提取的文本及Markdown格式)以及调整系统提示以包含任务细节和关于结构化输入的建议,从而提升模型的精确匹配准确率,尤其是在GPT-4.1上效果显著。

链接: https://arxiv.org/abs/2505.12837
作者: Christian Braun,Alexander Lilienbeck,Daniel Mentjukov
机构: Osborne Clarke(奥斯本·克拉克律师事务所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures

点击查看摘要

Abstract:Legal contracts possess an inherent, semantically vital structure (e.g., sections, clauses) that is crucial for human comprehension but whose impact on LLM processing remains under-explored. This paper investigates the effects of explicit input text structure and prompt engineering on the performance of GPT-4o and GPT-4.1 on a legal question-answering task using an excerpt of the CUAD. We compare model exact-match accuracy across various input formats: well-structured plain-text (human-generated from CUAD), plain-text cleaned of line breaks, extracted plain-text from Azure OCR, plain-text extracted by GPT-4o Vision, and extracted (and interpreted) Markdown (MD) from GPT-4o Vision. To give an indication of the impact of possible prompt engineering, we assess the impact of shifting task instructions to the system prompt and explicitly informing the model about the structured nature of the input. Our findings reveal that GPT-4o demonstrates considerable robustness to variations in input structure, but lacks in overall performance. Conversely, GPT-4.1’s performance is markedly sensitive; poorly structured inputs yield suboptimal results (but identical with GPT-4o), while well-structured formats (original CUAD text, GPT-4o Vision text and GPT-4o MD) improve exact-match accuracy by ~20 percentage points. Optimizing the system prompt to include task details and an advisory about structured input further elevates GPT-4.1’s accuracy by an additional ~10-13 percentage points, with Markdown ultimately achieving the highest performance under these conditions (79 percentage points overall exact-match accuracy). This research empirically demonstrates that while newer models exhibit greater resilience, careful input structuring and strategic prompt design remain critical for optimizing the performance of LLMs, and can significantly affect outcomes in high-stakes legal applications.
zh

[NLP-88] FlightGPT : Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models

【速读】: 该论文旨在解决无人机视觉与语言导航(UAV VLN)中多模态融合不足、泛化能力弱以及决策可解释性差的问题。其解决方案的关键在于提出一种基于视觉-语言模型(VLMs)的新型框架FlightGPT,通过两阶段训练流程:首先利用高质量示范进行监督微调(SFT)以提升初始化和结构化推理能力;随后采用由综合奖励驱动的群体相对策略优化(GRPO)算法,增强模型的泛化性和适应性。此外,FlightGPT引入了基于思维链(CoT)的推理机制,以提高决策的可解释性。

链接: https://arxiv.org/abs/2505.12835
作者: Hengxing Cai,Jinhan Dong,Jingjun Tan,Jingcheng Deng,Sihang Li,Zhifeng Gao,Haidong Wang,Zicheng Su,Agachai Sumalee,Renxin Zhong
机构: Sun Yat-Sen University (中山大学); DP Technology; Beijing University Of Posts and Telecommunications (北京邮电大学); Chinese Academy of Sciences (中国科学院); Tongji University (同济大学); Chulalongkorn University (朱拉隆功大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection. However, existing methods often struggle with insufficient multimodal fusion, weak generalization, and poor interpretability. To address these challenges, we propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities. We design a two-stage training pipeline: first, Supervised Fine-Tuning (SFT) using high-quality demonstrations to improve initialization and structured reasoning; then, Group Relative Policy Optimization (GRPO) algorithm, guided by a composite reward that considers goal accuracy, reasoning quality, and format compliance, to enhance generalization and adaptability. Furthermore, FlightGPT introduces a Chain-of-Thought (CoT)-based reasoning mechanism to improve decision interpretability. Extensive experiments on the city-scale dataset CityNav demonstrate that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22% higher success rate than the strongest baseline in unseen environments. Our implementation is publicly available.
zh

[NLP-89] Contrastive Prompting Enhances Sentence Embeddings in LLM s through Inference-Time Steering ACL2025

【速读】: 该论文试图解决从大型语言模型(Large Language Models, LLMs)中提取句子嵌入时,现有方法依赖最后一个词的嵌入往往包含过多非关键信息(如停用词),从而限制了其语义编码能力的问题。解决方案的关键在于提出一种对比提示(Contrastive Prompting, CP)方法,通过引入一个额外的辅助提示与原提示进行对比,引导模型将句子的核心语义信息编码到嵌入中,而非非必要信息。CP是一种即插即用的推理阶段干预方法,能够与多种基于提示的方法结合,提升不同LLMs在语义文本相似性(STS)任务和下游分类任务中的性能。

链接: https://arxiv.org/abs/2505.12831
作者: Zifeng Cheng,Zhonghui Wang,Yuchen Fu,Zhiwei Jiang,Yafeng Yin,Cong Wang,Qing Gu
机构: Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025

点击查看摘要

Abstract:Extracting sentence embeddings from large language models (LLMs) is a practical direction, as it requires neither additional data nor fine-tuning. Previous studies usually focus on prompt engineering to guide LLMs to encode the core semantic information of the sentence into the embedding of the last token. However, the last token in these methods still encodes an excess of non-essential information, such as stop words, limiting its encoding capacity. To this end, we propose a Contrastive Prompting (CP) method that introduces an extra auxiliary prompt to elicit better sentence embedding. By contrasting with the auxiliary prompt, CP can steer existing prompts to encode the core semantics of the sentence, rather than non-essential information. CP is a plug-and-play inference-time intervention method that can be combined with various prompt-based methods. Extensive experiments on Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our method can improve the performance of existing prompt-based methods across different LLMs. Our code will be released at this https URL.
zh

[NLP-90] SynDec: A Synthesize-then-Decode Approach for Arbitrary Textual Style Transfer via Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在任意文本风格迁移任务中面临的两个关键问题:一是对人工构建提示(prompt)的高度依赖,二是LLMs固有的风格偏见。其解决方案的关键在于提出一种新颖的“合成-解码”(Synthesize-then-Decode, SynDec)方法,该方法通过自动合成高质量提示并在解码过程中增强其作用来提升风格迁移效果。具体而言,该方法通过选择代表性少样本示例、进行四维风格分析以及重新排序候选提示来合成提示,并在LLM解码阶段通过最大化有无合成提示及提示与负样本之间的输出概率对比来放大TST效应。

链接: https://arxiv.org/abs/2505.12821
作者: Han Sun,Zhen Sun,Zongmin Zhang,Linzhao Jia,Wei Shao,Min Zhang
机构: East China Normal University (华东师范大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are emerging as dominant forces for textual style transfer. However, for arbitrary style transfer, LLMs face two key challenges: (1) considerable reliance on manually-constructed prompts and (2) rigid stylistic biases inherent in LLMs. In this paper, we propose a novel Synthesize-then-Decode (SynDec) approach, which automatically synthesizes high-quality prompts and amplifies their roles during decoding process. Specifically, our approach synthesizes prompts by selecting representative few-shot samples, conducting a four-dimensional style analysis, and reranking the candidates. At LLM decoding stage, the TST effect is amplified by maximizing the contrast in output probabilities between scenarios with and without the synthesized prompt, as well as between prompts and negative samples. We conduct extensive experiments and the results show that SynDec outperforms existing state-of-the-art LLM-based methods on five out of six benchmarks (e.g., achieving up to a 9% increase in accuracy for modern-to-Elizabethan English transfer). Detailed ablation studies further validate the effectiveness of SynDec.
zh

[NLP-91] PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLM s

【速读】: 该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的角色扮演方法在建模角色内在和外在维度时存在不足,以及角色记忆模拟缺乏显式对齐导致记忆一致性受损的问题,这些问题影响了角色扮演LLM在可信社交模拟等应用中的可靠性。其解决方案的关键在于提出PsyMem框架,该框架通过引入26个细粒度的心理属性指标来补充文本描述,并通过记忆对齐训练显式指导模型生成与角色记忆一致的响应,从而实现推理过程中的动态记忆控制。

链接: https://arxiv.org/abs/2505.12814
作者: Xilong Cheng,Yunxiao Qin,Yuting Tan,Zhengnan Li,Ye Wang,Hongjiang Xiao,Yuan Zhang
机构: Communication University of China (中国传媒大学); State Key Laboratory of Media Convergence and Communication (媒体融合与传播国家重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing LLM-based role-playing methods often rely on superficial textual descriptions or simplistic metrics, inadequately modeling both intrinsic and extrinsic character dimensions. Additionally, they typically simulate character memory with implicit model knowledge or basic retrieval augment generation without explicit memory alignment, compromising memory consistency. The two issues weaken reliability of role-playing LLMs in several applications, such as trustworthy social simulation. To address these limitations, we propose PsyMem, a novel framework integrating fine-grained psychological attributes and explicit memory control for role-playing. PsyMem supplements textual descriptions with 26 psychological indicators to detailed model character. Additionally, PsyMem implements memory alignment training, explicitly trains the model to align character’s response with memory, thereby enabling dynamic memory-controlled responding during inference. By training Qwen2.5-7B-Instruct on our specially designed dataset (including 5,414 characters and 38,962 dialogues extracted from novels), the resulting model, termed as PsyMem-Qwen, outperforms baseline models in role-playing, achieving the best performance in human-likeness and character fidelity.
zh

[NLP-92] Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)评估中存在的一系列问题,包括传统闭合式问答基准测试因模型性能饱和而失效,以及依赖人工评判的众包排行榜成本高、效率低。其解决方案的关键在于提出了一种名为Decentralized Arena(dearena)的完全自动化的框架,通过利用所有LLMs的集体智能来相互评估,从而减少单一模型评判者的偏差,并通过两种核心组件实现高效扩展:一是基于粗到精排名算法的快速增量插入新模型机制,具有次二次复杂度;二是用于构建新评估维度的自动题目选择策略。

链接: https://arxiv.org/abs/2505.12808
作者: Yanbin Yin,Kun Zhou,Zhen Wang,Xiangdong Zhang,Yifei Shao,Shibo Hao,Yi Gu,Jieyuan Liu,Somanshu Singla,Tianyang Liu,Eric P. Xing,Zhengzhong Liu,Haojian Jin,Zhiting Hu
机构: University of California, San Diego (加利福尼亚大学圣地亚哥分校); Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, ongoing work

点击查看摘要

Abstract:The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few “authority” models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on this https URL.
zh

[NLP-93] EAVIT: Efficient and Accurate Human Value Identification from Text data via LLM s

【速读】: 该论文旨在解决在长文本中高效且准确识别人类价值观的问题,尤其是针对在线大型语言模型(LLM)在处理长上下文时性能下降及计算成本高的问题。其解决方案的关键在于提出EAVIT框架,该框架结合了本地可微调模型与在线黑盒LLM的优势,通过一个小型本地语言模型(价值检测器)生成初步价值估计,并据此构建简洁的输入提示,从而提升在线LLM的价值识别准确性,同时显著减少输入令牌数量。

链接: https://arxiv.org/abs/2505.12792
作者: Wenhao Zhu,Yuhang Xie,Guojie Song,Xin Zhang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has revolutionized various fields, including the identification and discovery of human values within text data. While traditional NLP models, such as BERT, have been employed for this task, their ability to represent textual data is significantly outperformed by emerging LLMs like GPTs. However, the performance of online LLMs often degrades when handling long contexts required for value identification, which also incurs substantial computational costs. To address these challenges, we propose EAVIT, an efficient and accurate framework for human value identification that combines the strengths of both locally fine-tunable and online black-box LLMs. Our framework employs a value detector - a small, local language model - to generate initial value estimations. These estimations are then used to construct concise input prompts for online LLMs, enabling accurate final value identification. To train the value detector, we introduce explanation-based training and data generation techniques specifically tailored for value identification, alongside sampling strategies to optimize the brevity of LLM input prompts. Our approach effectively reduces the number of input tokens by up to 1/6 compared to directly querying online LLMs, while consistently outperforming traditional NLP methods and other LLM-based strategies.
zh

[NLP-94] A Token is Worth over 1000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

【速读】: 该论文旨在解决训练高性能小型语言模型(Small Language Models, SLMs)时存在的三个关键挑战:信息丢失、表征对齐效率低下以及对信息激活(尤其是前馈网络Feed-Forward Networks, FFNs)的利用不足。其解决方案的关键在于提出一种名为低秩克隆(Low-Rank Clone, LRC)的高效预训练方法,通过联合训练低秩投影矩阵实现教师模型权重的压缩(软剪枝)和学生模型激活(包括FFN信号)与教师模型的对齐,从而在不依赖显式对齐模块的情况下最大化知识迁移效果。

链接: https://arxiv.org/abs/2505.12781
作者: Jitai Hao,Qiang Huang,Hao Liu,Xinyan Xiao,Zhaochun Ren,Jun Yu
机构: Harbin Institute of Technology, Shenzhen; Baidu Inc.; Leiden University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens–while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at this https URL and this https URL.
zh

[NLP-95] ReEx-SQL: Reasoning with Execution-Aware Reinforcement Learning for Text-to-SQL

【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)任务中执行反馈未能有效融入生成过程的问题,现有方法将执行反馈仅视为事后修正信号,无法在生成过程中动态调整推理,从而影响查询的准确性和鲁棒性。解决方案的关键在于提出ReEx-SQL框架,该框架通过引入执行感知的推理范式,将中间SQL执行结果嵌入推理路径,实现上下文敏感的修正,并结合结构化提示、分步滚动策略和复合奖励函数,使模型能够动态适应执行反馈。此外,采用基于树的解码策略提升了推理效率与性能。

链接: https://arxiv.org/abs/2505.12768
作者: Yaxun Dai(1),Wenxuan Xie(3),Xialie Zhuang(4),Tianyu Yang(5),Yiying Yang(2),Haiqin Yang(6),Yuhang Zhao(2),Pingfu Chao(1),Wenhao Jiang(2) ((1) Soochow University, (2) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), (3) South China University of Technology, (4) University of Chinese Academy of Sciences, (5) Alibaba DAMO Academy, (6) International Digital Economy Academy (IDEA))
机构: Soochow University (苏州大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); South China University of Technology (华南理工大学); University of Chinese Academy of Sciences (中国科学院大学); Alibaba DAMO Academy (阿里巴巴达摩院); International Digital Economy Academy (IDEA), China (中国数字经济研究院(IDEA))
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Text-to-SQL, execution feedback is essential for guiding large language models (LLMs) to reason accurately and generate reliable SQL queries. However, existing methods treat execution feedback solely as a post-hoc signal for correction or selection, failing to integrate it into the generation process. This limitation hinders their ability to address reasoning errors as they occur, ultimately reducing query accuracy and robustness. To address this issue, we propose ReEx-SQL (Reasoning with Execution-Aware Reinforcement Learning), a framework for Text-to-SQL that enables models to interact with the database during decoding and dynamically adjust their reasoning based on execution feedback. ReEx-SQL introduces an execution-aware reasoning paradigm that interleaves intermediate SQL execution into reasoning paths, facilitating context-sensitive revisions. It achieves this through structured prompts with markup tags and a stepwise rollout strategy that integrates execution feedback into each stage of generation. To supervise policy learning, we develop a composite reward function that includes an exploration reward, explicitly encouraging effective database interaction. Additionally, ReEx-SQL adopts a tree-based decoding strategy to support exploratory reasoning, enabling dynamic expansion of alternative reasoning paths. Notably, ReEx-SQL achieves 88.8% on Spider and 64.9% on BIRD at the 7B scale, surpassing the standard reasoning baseline by 2.7% and 2.6%, respectively. It also shows robustness, achieving 85.2% on Spider-Realistic with leading performance. In addition, its tree-structured decoding improves efficiency and performance over linear decoding, reducing inference time by 51.9% on the BIRD development set.
zh

[NLP-96] Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization ACL2025

【速读】: 该论文试图解决当前奖励模型(Reward Models, RMs)评估基准与优化策略性能之间相关性较弱的问题,即现有基准无法准确评估RMs的真实能力。其解决方案的关键在于通过奖励过优化(reward overoptimization)现象来设计评估方法,强调在构建可靠基准时需关注选择与拒绝响应之间的差异、多角度比较以及响应来源的多样性,同时指出应将过优化程度作为工具而非最终目标。

链接: https://arxiv.org/abs/2505.12763
作者: Sunghwan Kim,Dongjin Kang,Taeyoon Kwon,Hyungjoo Chae,Dongha Lee,Jinyoung Yeo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2025

点击查看摘要

Abstract:Reward models (RMs) play a crucial role in reinforcement learning from human feedback (RLHF), aligning model behavior with human preferences. However, existing benchmarks for reward models show a weak correlation with the performance of optimized policies, suggesting that they fail to accurately assess the true capabilities of RMs. To bridge this gap, we explore several evaluation designs through the lens of reward overoptimization\textemdash a phenomenon that captures both how well the reward model aligns with human preferences and the dynamics of the learning signal it provides to the policy. The results highlight three key findings on how to construct a reliable benchmark: (i) it is important to minimize differences between chosen and rejected responses beyond correctness, (ii) evaluating reward models requires multiple comparisons across a wide range of chosen and rejected responses, and (iii) given that reward models encounter responses with diverse representations, responses should be sourced from a variety of models. However, we also observe that a extremely high correlation with degree of overoptimization leads to comparatively lower correlation with certain downstream performance. Thus, when designing a benchmark, it is desirable to use the degree of overoptimization as a useful tool, rather than the end goal.
zh

[NLP-97] What is Stigma Attributed to? A Theory-Grounded Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma ACL2025

【速读】: 该论文试图解决心理健康污名(mental-health stigma)在社会中广泛存在且阻碍治疗和康复的问题,以及现有用于训练神经模型进行精细分类的资源有限、主要依赖社交媒体或合成数据而缺乏理论基础的困境。解决方案的关键在于构建一个由专家标注、理论指导的人机对话语料库,包含4,141个来自684名具有明确社会文化背景参与者的对话片段,从而为计算检测、中和和应对心理健康污名的研究提供实证基础。

链接: https://arxiv.org/abs/2505.12727
作者: Han Meng,Yancan Chen,Yunan Li,Yitian Yang,Jungup Lee,Renwen Zhang,Yi-Chieh Lee
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted to ACL 2025 Main Conference, 35 Pages

点击查看摘要

Abstract:Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma.
zh

[NLP-98] On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中,不同编程语言之间性能差异显著的问题。其关键解决方案是通过代码翻译任务训练LLMs,以促进跨多种编程语言的编码能力迁移,并引入一种新的强化学习(Reinforcement Learning, RL)框架——OORL,该框架结合了在线策略和离线策略。在OORL中,基于规则的奖励信号由单元测试引导,同时提出了一种新的偏好优化方法——Group Equivalent Preference Optimization (GEPO),通过中间表示(Intermediate Representations, IRs)组进行训练,使LLMs能够区分等价与非等价的IRs,并利用组内IRs之间的相互等价性信号,从而更精确地捕捉代码功能的细微差别。

链接: https://arxiv.org/abs/2505.12723
作者: Haoyuan Wu,Rui Ming,Jilong Gao,Hangyu Zhao,Xueyi Chen,Yikai Yang,Haisheng Zheng,Zhuolun He,Bei Yu
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Artificial Intelligent Laboratory (上海人工智能实验室); ChatEDA Tech (ChatEDA科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can be guided to discern IRs equivalent to the source code from inequivalent ones, while also utilizing signals about the mutual equivalence between IRs within the group. This process allows LLMs to capture nuanced aspects of code functionality. By employing OORL for training with code translation tasks, LLMs improve their recognition of code functionality and their understanding of the relationships between code implemented in different languages. Extensive experiments demonstrate that our OORL for LLMs training with code translation tasks achieves significant performance improvements on code benchmarks across multiple programming languages.
zh

[NLP-99] Automated Bias Assessment in AI-Generated Educational Content Using CEAT Framework

【速读】: 该论文试图解决生成式人工智能(Generative AI)在教育内容生成中可能存在的偏见问题,尤其是性别、种族或国家刻板印象等伦理和教育方面的担忧。其解决方案的关键在于提出一种自动化偏见评估方法,该方法结合了上下文嵌入关联测试与提示工程词提取技术,并集成于检索增强生成框架中,从而实现对AI生成教育材料中偏见的可靠、一致评估。

链接: https://arxiv.org/abs/2505.12718
作者: Jingyang Peng,Wenyuan Shen,Jiarui Rao,Jionghao Lin
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted by AIED 2025: Late-Breaking Results (LBR) Track

点击查看摘要

Abstract:Recent advances in Generative Artificial Intelligence (GenAI) have transformed educational content creation, particularly in developing tutor training materials. However, biases embedded in AI-generated content–such as gender, racial, or national stereotypes–raise significant ethical and educational concerns. Despite the growing use of GenAI, systematic methods for detecting and evaluating such biases in educational materials remain limited. This study proposes an automated bias assessment approach that integrates the Contextualized Embedding Association Test with a prompt-engineered word extraction method within a Retrieval-Augmented Generation framework. We applied this method to AI-generated texts used in tutor training lessons. Results show a high alignment between the automated and manually curated word sets, with a Pearson correlation coefficient of r = 0.993, indicating reliable and consistent bias assessment. Our method reduces human subjectivity and enhances fairness, scalability, and reproducibility in auditing GenAI-produced educational content.
zh

[NLP-100] oTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在进行长链式思维(Chain-of-Thought, CoT)推理时存在的输出冗长和效率低下问题,以及传统CoT方法中缺乏系统性逻辑推理的问题。其解决方案的关键在于引入树式思维(Tree-of-Thoughts, ToT),通过将推理过程建模为树状结构,实现多条推理路径的并行生成与评估,从而提升推理性能并降低token成本。论文进一步提出了一种基于规则奖励的在线策略强化学习框架(Tree-of-Thoughts Reinforcement Learning, ToTRL),并在拼图游戏环境中训练LLMs以增强其ToT推理能力。

链接: https://arxiv.org/abs/2505.12717
作者: Haoyuan Wu,Xueyi Chen,Rui Ming,Jilong Gao,Shoubo Hu,Zhuolun He,Bei Yu
机构: The Chinese University of Hong Kong (香港中文大学); Noah’s Ark Lab, Huawei (华为诺亚方舟实验室); ChatEDA Tech (ChatEDA科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.
zh

[NLP-101] Shadow-FT: Tuning Instruct via Base

【速读】: 该论文试图解决在对生成式 AI (Generative AI) 的指令调优模型(INSTRUCT models)进行微调时,通常只能获得有限的性能提升甚至出现性能退化的问题。解决方案的关键在于利用对应的基座模型(BASE models)进行微调,并将学习到的权重更新直接迁移到 INSTRUCT 模型中,从而实现无需引入额外参数的高效微调方法,即 Shadow-FT 框架。

链接: https://arxiv.org/abs/2505.12716
作者: Taiqiang Wu,Runming Yang,Jiayi Li,Pengfei Hu,Ngai Wong,Yujiu Yang
机构: The University of Hong Kong(香港大学); Tsinghua University(清华大学); Tencent(腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the INSTRUCT (i.e., instruction tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired BASE models, the foundation for these INSTRUCT variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). Therefore, we propose a novel Shadow-FT framework to tune the INSTRUCT models by leveraging the corresponding BASE models. The key insight is to fine-tune the BASE model, and then directly graft the learned weight updates to the INSTRUCT model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization (DPO). Codes and weights are available at \hrefthis https URLGithub.
zh

[NLP-102] Bullying the Machine: How Personas Increase LLM Vulnerability

【速读】: 该论文试图解决生成式 AI (Generative AI) 在角色扮演(persona)条件下可能引入的安全风险问题,特别是在遭受网络霸凌(bullying)等心理操控时模型输出的不安全性。解决方案的关键在于构建一个模拟框架,通过让攻击者 LLM 使用基于心理学的霸凌策略与采用大五人格特质(Big Five personality traits)角色的受害者 LLM 进行交互,从而评估不同角色配置对模型安全性的潜在影响。研究发现,某些角色特征(如降低宜人性或尽责性)会显著增加模型对不当输出的敏感性,尤其在涉及情感或讽刺操控的霸凌策略下更为明显。

链接: https://arxiv.org/abs/2505.12692
作者: Ziwei Xu,Udit Sanghi,Mohan Kankanhalli
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in interactions where they are prompted to adopt personas. This paper investigates whether such persona conditioning affects model safety under bullying, an adversarial manipulation that applies psychological pressures in order to force the victim to comply to the attacker. We introduce a simulation framework in which an attacker LLM engages a victim LLM using psychologically grounded bullying tactics, while the victim adopts personas aligned with the Big Five personality traits. Experiments using multiple open-source LLMs and a wide range of adversarial goals reveal that certain persona configurations – such as weakened agreeableness or conscientiousness – significantly increase victim’s susceptibility to unsafe outputs. Bullying tactics involving emotional or sarcastic manipulation, such as gaslighting and ridicule, are particularly effective. These findings suggest that persona-driven interaction introduces a novel vector for safety risks in LLMs and highlight the need for persona-aware safety evaluation and alignment strategies.
zh

[NLP-103] Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

【速读】: 该论文试图解决当前基于大语言模型(Large Language Model, LLM)的定理自动证明工具在处理数学不等式时,是否具备类似人类的数学结构理解能力的问题。其解决方案的关键在于构建一个名为Ineq-Comp的基准测试集,该基准通过系统化的变换方法(如变量复制、代数重写和多步骤组合)生成基础不等式,以评估证明工具在处理具有人类直觉的组合性问题时的表现。研究结果表明,尽管这些工具在语法正确性上表现良好,但在组合推理方面存在显著不足,揭示了当前AI证明工具与人类数学直觉之间的持续差距。

链接: https://arxiv.org/abs/2505.12680
作者: Haoyu Zhao,Yihan Geng,Shange Tang,Yong Lin,Bohan Lyu,Hongzhou Lin,Chi Jin,Sanjeev Arora
机构: Princeton Language and Intelligence, Princeton University (普林斯顿语言与智能中心,普林斯顿大学); Peking University (北京大学); Tsinghua University (清华大学); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages

点击查看摘要

Abstract:LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question through the lens of mathematical inequalities – a fundamental tool across many domains. While modern provers can solve basic inequalities, we probe their ability to handle human-intuitive compositionality. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers – including Goedel, STP, and Kimina-7B – struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness – possibly because it is trained to decompose the problems into sub-problems – but still suffers a 20% performance drop (pass@32). Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition.
zh

[NLP-104] Know3-RAG : A Knowledge-aware RAG Framework with Adaptive Retrieval Generation and Filtering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言生成过程中产生的幻觉或缺乏依据的内容问题,以及现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在适应性控制和参考文本准确性方面的局限性。其解决方案的关键在于提出Know3-RAG框架,该框架通过引入结构化知识图谱(Knowledge Graphs, KGs)来指导RAG过程的三个核心阶段:检索、生成和过滤,具体包括基于KG嵌入的知识感知自适应检索模块、基于KG实体增强的参考生成策略以及基于知识驱动的参考过滤机制,从而提升生成内容的语义一致性和事实准确性。

链接: https://arxiv.org/abs/2505.12662
作者: Xukai Liu,Ye Liu,Shiwen Wu,Yanghai Zhang,Yihao Yuan,Kai Zhang,Qi Liu
机构: School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China (安徽省合肥市认知智能国家重点实验室); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to impressive progress in natural language generation, yet their tendency to produce hallucinated or unsubstantiated content remains a critical concern. To improve factual reliability, Retrieval-Augmented Generation (RAG) integrates external knowledge during inference. However, existing RAG systems face two major limitations: (1) unreliable adaptive control due to limited external knowledge supervision, and (2) hallucinations caused by inaccurate or irrelevant references. To address these issues, we propose Know3-RAG, a knowledge-aware RAG framework that leverages structured knowledge from knowledge graphs (KGs) to guide three core stages of the RAG process, including retrieval, generation, and filtering. Specifically, we introduce a knowledge-aware adaptive retrieval module that employs KG embedding to assess the confidence of the generated answer and determine retrieval necessity, a knowledge-enhanced reference generation strategy that enriches queries with KG-derived entities to improve generated reference relevance, and a knowledge-driven reference filtering mechanism that ensures semantic alignment and factual accuracy of references. Experiments on multiple open-domain QA benchmarks demonstrate that Know3-RAG consistently outperforms strong baselines, significantly reducing hallucinations and enhancing answer reliability.
zh

[NLP-105] Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic Acoustic and Visual Signals ACL2025

【速读】: 该论文旨在解决在人机对话中利用多模态信号(语言、语音和视觉)预测轮流说话(turn-taking)和反馈行为(backchannel)的不足。其关键解决方案是提出一种自动数据收集管道,用于收集和标注超过210小时的人类对话视频,并构建了一个包含超过1.5百万词及对应轮流说话和反馈行为标注的多模态面对面(Multi-Modal Face-to-Face, MM-F2F)对话数据集。此外,还设计了一个端到端框架,能够从多模态信号中预测轮流说话和反馈行为的概率,强调模态间的相互关系并支持文本、音频和视频输入的任意组合,从而适应多种现实场景。

链接: https://arxiv.org/abs/2505.12654
作者: Yuxin Lin,Yinglin Zheng,Ming Zeng,Wangzheng Shi
机构: Xiamen University (厦门大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepected by ACL 2025

点击查看摘要

Abstract:This paper addresses the gap in predicting turn-taking and backchannel actions in human-machine conversations using multi-modal signals (linguistic, acoustic, and visual). To overcome the limitation of existing datasets, we propose an automatic data collection pipeline that allows us to collect and annotate over 210 hours of human conversation videos. From this, we construct a Multi-Modal Face-to-Face (MM-F2F) human conversation dataset, including over 1.5M words and corresponding turn-taking and backchannel annotations from approximately 20M frames. Additionally, we present an end-to-end framework that predicts the probability of turn-taking and backchannel actions from multi-modal signals. The proposed model emphasizes the interrelation between modalities and supports any combination of text, audio, and video inputs, making it adaptable to a variety of realistic scenarios. Our experiments show that our approach achieves state-of-the-art performance on turn-taking and backchannel prediction tasks, achieving a 10% increase in F1-score on turn-taking and a 33% increase on backchannel prediction. Our dataset and code are publicly available online to ease of subsequent research.
zh

[NLP-106] Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing ACL2025

【速读】: 该论文试图解决语言模型在知识编辑过程中出现的“表面编辑”(superficial editing)问题,即尽管编辑算法在传统指标上表现优异,但模型仍可能生成原始知识。解决方案的关键在于识别并验证导致该问题的两个核心因素:早期层中最后一个主体位置的残差流以及后期层中的特定注意力模块。研究进一步发现,后期层中的某些注意力头及其输出矩阵中的特定左奇异向量与表面编辑存在因果关系,这一发现为理解并改进知识编辑方法提供了重要依据。

链接: https://arxiv.org/abs/2505.12636
作者: Jiakuan Xie,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学); The Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (认知与复杂系统决策智能实验室,中国科学院自动化研究所)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 main

点击查看摘要

Abstract:Knowledge editing, which aims to update the knowledge encoded in language models, can be deceptive. Despite the fact that many existing knowledge editing algorithms achieve near-perfect performance on conventional metrics, the models edited by them are still prone to generating original knowledge. This paper introduces the concept of “superficial editing” to describe this phenomenon. Our comprehensive evaluation reveals that this issue presents a significant challenge to existing algorithms. Through systematic investigation, we identify and validate two key factors contributing to this issue: (1) the residual stream at the last subject position in earlier layers and (2) specific attention modules in later layers. Notably, certain attention heads in later layers, along with specific left singular vectors in their output matrices, encapsulate the original knowledge and exhibit a causal relationship with superficial editing. Furthermore, we extend our analysis to the task of superficial unlearning, where we observe consistent patterns in the behavior of specific attention heads and their corresponding left singular vectors, thereby demonstrating the robustness and broader applicability of our methodology and conclusions. Our code is available here.
zh

[NLP-107] Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents CVPR2025

【速读】: 该论文旨在解决移动操作系统(Mobile OS)导航任务中模型跨平台泛化能力不足的问题。其关键解决方案是引入MONDAY数据集,该数据集包含来自20K条教学视频的313K标注帧,覆盖多种真实场景下的移动操作系统导航操作。此外,论文提出了一种自动化框架,通过OCR-based场景检测、近完美的UI元素检测以及多步骤动作识别,实现无需人工标注的大规模任务数据集构建,从而支持移动平台演进中的持续数据扩展。

链接: https://arxiv.org/abs/2505.12632
作者: Yunseok Jang,Yeda Song,Sungryull Sohn,Lajanugen Logeswaran,Tiange Luo,Dong-Ki Kim,Kyunghoon Bae,Honglak Lee
机构: University of Michigan (密歇根大学); LG AI Research (LG人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CVPR 2025

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale dataset of 313K annotated frames from 20K instructional videos capturing diverse real-world mobile OS navigation across multiple platforms. Models that include MONDAY in their pre-training phases demonstrate robust cross-platform generalization capabilities, consistently outperforming models trained on existing single OS datasets while achieving an average performance gain of 18.11%p on an unseen mobile OS platform. To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. Our framework comprises robust OCR-based scene detection (95.04% F1score), near-perfect UI element detection (99.87% hit ratio), and novel multi-step action identification to extract reliable action sequences across diverse interface configurations. We contribute both the MONDAY dataset and our automated collection framework to facilitate future research in mobile OS navigation.
zh

[NLP-108] Enhancing Latent Computation in Transformers with Latent Tokens

【速读】: 该论文试图解决如何通过引入辅助机制来提升大规模语言模型(Large Language Models, LLMs)的性能问题,特别是在分布外泛化场景下的适应能力。其解决方案的关键在于提出了一种轻量级的方法——潜在令牌(latent tokens),这些虚拟令牌在自然语言中可能不具备可解释性,但能够通过注意力机制引导基于Transformer的LLMs的自回归解码过程。该方法能够在参数高效训练的前提下,无缝集成到预训练的Transformer模型中,并在推理阶段灵活应用,同时对现有Transformer架构的复杂度影响极小。

链接: https://arxiv.org/abs/2505.12629
作者: Yuchang Sun,Yanxi Chen,Yaliang Li,Bolin Ding
机构: Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Augmenting large language models (LLMs) with auxiliary tokens has emerged as a promising strategy for enhancing model performance. In this work, we introduce a lightweight method termed latent tokens; these are dummy tokens that may be non-interpretable in natural language but steer the autoregressive decoding process of a Transformer-based LLM via the attention mechanism. The proposed latent tokens can be seamlessly integrated with a pre-trained Transformer, trained in a parameter-efficient manner, and applied flexibly at inference time, while adding minimal complexity overhead to the existing infrastructure of standard Transformers. We propose several hypotheses about the underlying mechanisms of latent tokens and design synthetic tasks accordingly to verify them. Numerical results confirm that the proposed method noticeably outperforms the baselines, particularly in the out-of-distribution generalization scenarios, highlighting its potential in improving the adaptability of LLMs.
zh

[NLP-109] R1dacted: Investigating Local Censorship in DeepSeek s R1 Language Model

【速读】: 该论文试图解决生成式 AI (Generative AI) 在特定政治敏感话题上表现出的审查行为问题,这种行为不同于传统安全机制,更接近于内容审查。解决方案的关键在于构建一个大规模、高度筛选的被 R1 模型屏蔽的提示语集,并对其审查模式进行系统分析,包括其一致性、触发条件及跨主题、提示语结构和上下文的变化。此外,研究还探讨了审查行为在其他语言和基于 R1 的蒸馏模型中的可迁移性,并提出规避或移除此类审查的技术手段。

链接: https://arxiv.org/abs/2505.12625
作者: Ali Naseh,Harsh Chaudhari,Jaechul Roh,Mingshi Wu,Alina Oprea,Amir Houmansadr
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Northeastern University (东北大学); GFW Report (防火墙报告)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:DeepSeek recently released R1, a high-performing large language model (LLM) optimized for reasoning tasks. Despite its efficient training pipeline, R1 achieves competitive performance, even surpassing leading reasoning models like OpenAI’s o1 on several benchmarks. However, emerging reports suggest that R1 refuses to answer certain prompts related to politically sensitive topics in China. While existing LLMs often implement safeguards to avoid generating harmful or offensive outputs, R1 represents a notable shift - exhibiting censorship-like behavior on politically charged queries. In this paper, we investigate this phenomenon by first introducing a large-scale set of heavily curated prompts that get censored by R1, covering a range of politically sensitive topics, but are not censored by other models. We then conduct a comprehensive analysis of R1’s censorship patterns, examining their consistency, triggers, and variations across topics, prompt phrasing, and context. Beyond English-language queries, we explore censorship behavior in other languages. We also investigate the transferability of censorship to models distilled from the R1 language model. Finally, we propose techniques for bypassing or removing this censorship. Our findings reveal possible additional censorship integration likely shaped by design choices during training or alignment, raising concerns about transparency, bias, and governance in language model deployment.
zh

[NLP-110] hink Before You Attribute: Improving the Performance of LLM s Attribution Systems

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在科学领域应用中因缺乏可信赖、可验证输出而面临的广泛采纳障碍,特别是由于缺乏可靠的来源归属或错误的归因问题。解决方案的关键在于提出一种基于句子级别的预归因步骤,用于增强检索生成(Retrieve-Augmented Generation, RAG)系统,该步骤将句子分为三类:不可归因、可归因于单一引用和可归因于多个引用。通过在归因前对句子进行分类,可以为不同类型的句子选择适当的归因方法或直接跳过归因,从而提高归因系统的准确性并降低计算复杂度。

链接: https://arxiv.org/abs/2505.12621
作者: João Eduardo Batista,Emil Vatai,Mohamed Wahib
机构: RIKEN-CCS(理化学研究所计算科学中心)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 22 pages (9 pages of content, 4 pages of references, 9 pages of supplementary material), 7 figures, 10 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied in various science domains, yet their broader adoption remains constrained by a critical challenge: the lack of trustworthy, verifiable outputs. Current LLMs often generate answers without reliable source attribution, or worse, with incorrect attributions, posing a barrier to their use in scientific and high-stakes settings, where traceability and accountability are non-negotiable. To be reliable, attribution systems need high accuracy and retrieve data with short lengths, i.e., attribute to a sentence within a document rather than a whole document. We propose a sentence-level pre-attribution step for Retrieve-Augmented Generation (RAG) systems that classify sentences into three categories: not attributable, attributable to a single quote, and attributable to multiple quotes. By separating sentences before attribution, a proper attribution method can be selected for the type of sentence, or the attribution can be skipped altogether. Our results indicate that classifiers are well-suited for this task. In this work, we propose a pre-attribution step to reduce the computational complexity of attribution, provide a clean version of the HAGRID dataset, and provide an end-to-end attribution system that works out of the box.
zh

[NLP-111] Duluth at SemEval-2025 Task 7: TF-IDF with Optimized Vector Dimensions for Multilingual Fact-Checked Claim Retrieval SEMEVAL-2025

【速读】: 该论文旨在解决多语言和跨语言事实核查声明检索(Multilingual and Crosslingual Fact-Checked Claim Retrieval)问题,其核心挑战在于从多种语言的文本中高效准确地检索出与给定声明相关的已验证事实。解决方案的关键在于采用基于TF-IDF(Term Frequency-Inverse Document Frequency)的检索系统,并通过实验优化向量维度和分词策略。研究发现,使用词级分词并设置15,000个特征的词汇表,在开发集和测试集上分别取得了0.78和0.69的平均success@10得分,表明传统方法在资源受限环境下仍具有竞争力。

链接: https://arxiv.org/abs/2505.12616
作者: Shujauddin Syed,Ted Pedersen
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL)
备注: SemEval-2025

点击查看摘要

Abstract:This paper presents the Duluth approach to the SemEval-2025 Task 7 on Multilingual and Crosslingual Fact-Checked Claim Retrieval. We implemented a TF-IDF-based retrieval system with experimentation on vector dimensions and tokenization strategies. Our best-performing configuration used word-level tokenization with a vocabulary size of 15,000 features, achieving an average success@10 score of 0.78 on the development set and 0.69 on the test set across ten languages. Our system showed stronger performance on higher-resource languages but still lagged significantly behind the top-ranked system, which achieved 0.96 average success@10. Our findings suggest that though advanced neural architectures are increasingly dominant in multilingual retrieval tasks, properly optimized traditional methods like TF-IDF remain competitive baselines, especially in limited compute resource scenarios.
zh

[NLP-112] AD-AGENT : A Multi-agent Framework for End-to-end Anomaly Detection

【速读】: 该论文试图解决非专家用户在使用多样化的数据模态和日益增多的专业异常检测(Anomaly Detection, AD)库时所面临的挑战,这些问题包括缺乏特定库的知识和高级编程技能。解决方案的关键在于提出AD-AGENT,这是一个基于大语言模型(LLM)的多智能体框架,能够将自然语言指令转化为可执行的AD流水线,通过协调专门的智能体实现意图解析、数据准备、库与模型选择、文档挖掘以及迭代代码生成与调试,从而整合如PyOD、PyGOD和TSLib等流行的AD库到统一的工作流程中。

链接: https://arxiv.org/abs/2505.12594
作者: Tiankai Yang,Junjun Liu,Wingchun Siu,Jiahang Wang,Zhuangzhuang Qian,Chanjuan Song,Cheng Cheng,Xiyang Hu,Yue Zhao
机构: University of Southern California (南加州大学); Carnegie Mellon University (卡内基梅隆大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection (AD) is essential in areas such as fraud detection, network monitoring, and scientific research. However, the diversity of data modalities and the increasing number of specialized AD libraries pose challenges for non-expert users who lack in-depth library-specific knowledge and advanced programming skills. To tackle this, we present AD-AGENT, an LLM-driven multi-agent framework that turns natural-language instructions into fully executable AD pipelines. AD-AGENT coordinates specialized agents for intent parsing, data preparation, library and model selection, documentation mining, and iterative code generation and debugging. Using a shared short-term workspace and a long-term cache, the agents integrate popular AD libraries like PyOD, PyGOD, and TSLib into a unified workflow. Experiments demonstrate that AD-AGENT produces reliable scripts and recommends competitive models across libraries. The system is open-sourced to support further research and practical applications in AD.
zh

[NLP-113] PromptPrism: A Linguistically-Inspired Taxonomy for Prompts

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中提示(prompt)分析缺乏系统性框架的问题。现有研究在理解提示结构与组件方面存在不足,限制了对LLM行为的分析和性能优化。解决方案的关键在于提出PromptPrism,这是一个受语言学启发的分类体系,能够从功能结构、语义成分和句法模式三个层次对提示进行系统分析,从而为提示的优化、表征和分析提供基础。

链接: https://arxiv.org/abs/2505.12592
作者: Sullam Jeoung,Yueyan Chen,Yi Zhang,Shuai Wang,Haibo Ding,Lin Lee Cheong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompts are the interface for eliciting the capabilities of large language models (LLMs). Understanding their structure and components is critical for analyzing LLM behavior and optimizing performance. However, the field lacks a comprehensive framework for systematic prompt analysis and understanding. We introduce PromptPrism, a linguistically-inspired taxonomy that enables prompt analysis across three hierarchical levels: functional structure, semantic component, and syntactic pattern. We show the practical utility of PromptPrism by applying it to three applications: (1) a taxonomy-guided prompt refinement approach that automatically improves prompt quality and enhances model performance across a range of tasks; (2) a multi-dimensional dataset profiling method that extracts and aggregates structural, semantic, and syntactic characteristics from prompt datasets, enabling comprehensive analysis of prompt distributions and patterns; (3) a controlled experimental framework for prompt sensitivity analysis by quantifying the impact of semantic reordering and delimiter modifications on LLM performance. Our experimental results validate the effectiveness of our taxonomy across these applications, demonstrating that PromptPrism provides a foundation for refining, profiling, and analyzing prompts.
zh

[NLP-114] CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling

【速读】: 该论文旨在解决代码混杂语言(code-mixed language)在句内频繁切换语言所带来的结构挑战,这些问题超出了标准语言模型的处理能力。其解决方案的关键在于提出CMLFormer,这是一种增强的多层双解码器Transformer架构,具有共享编码器和同步解码器交叉注意力机制,能够建模代码混杂文本的语言和语义动态。CMLFormer通过在扩展的Hinglish语料库上进行预训练,并引入多种新目标以捕捉切换行为、跨语言结构和代码混杂复杂性,从而提升了对代码混杂语言的建模效果。

链接: https://arxiv.org/abs/2505.12587
作者: Aditeya Baral,Allen George Ajith,Roshan Nayak,Mrityunjay Abhijeet Bhanja
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer’s architecture and multi-task pre-training strategy for modeling code-mixed languages.
zh

[NLP-115] Improving Multilingual Language Models by Aligning Representations through Steering

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理非英语标记时的层表示问题,这是一个在该领域取得显著进展后仍存在的开放性问题。解决方案的关键在于通过表示引导(representation steering),具体而言是向单个模型层的激活值中添加一个学习到的向量,从而显著提升模型性能。这种方法在实验中表现出与翻译基线相当的结果,并优于现有的提示优化方法。

链接: https://arxiv.org/abs/2505.12584
作者: Omar Mahmoud,Buddhika Laknath Semage,Thommen George Karimpanal,Santu Rana
机构: Deakin University (迪肯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we investigate how large language models (LLMS) process non-English tokens within their layer representations, an open question despite significant advancements in the field. Using representation steering, specifically by adding a learned vector to a single model layer’s activations, we demonstrate that steering a single model layer can notably enhance performance. Our analysis shows that this approach achieves results comparable to translation baselines and surpasses state of the art prompt optimization methods. Additionally, we highlight how advanced techniques like supervised fine tuning (\textscsft) and reinforcement learning from human feedback (\textscrlhf) improve multilingual capabilities by altering representation spaces. We further illustrate how these methods align with our approach to reshaping LLMS layer representations.
zh

[NLP-116] Measuring Information Distortion in Hierarchical Ultra long Novel Generation:The Optimal Expansion Ratio

【速读】: 该论文试图解决如何在使用大型语言模型(Large Language Models, LLMs)撰写百万字小说时,确定必要的人工大纲长度以生成高质量内容的问题。其解决方案的关键在于引入一种分层的两阶段生成流程(大纲 - 详细大纲 - 手稿),并通过信息论分析量化不同压缩-扩展比率下LLMs在超长小说压缩与重建过程中的失真情况,从而找到在信息保留与人工投入之间取得平衡的最佳大纲长度。

链接: https://arxiv.org/abs/2505.12572
作者: Hanwen Shen,Ting Ying
机构: Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Writing novels with Large Language Models (LLMs) raises a critical question: how much human-authored outline is necessary to generate high-quality million-word novels? While frameworks such as DOME, PlanWrite, and Long Writer have improved stylistic coherence and logical consistency, they primarily target shorter novels (10k–100k words), leaving ultra-long generation largely unexplored. Drawing on insights from recent text compression methods like LLMZip and LLM2Vec, we conduct an information-theoretic analysis that quantifies distortion occurring when LLMs compress and reconstruct ultra-long novels under varying compression-expansion ratios. We introduce a hierarchical two-stage generation pipeline (outline - detailed outline - manuscript) and find an optimal outline length that balances information preservation with human effort. Through extensive experimentation with Chinese novels, we establish that a two-stage hierarchical outline approach significantly reduces semantic distortion compared to single-stage methods. Our findings provide empirically-grounded guidance for authors and researchers collaborating with LLMs to create million-word novels.
zh

[NLP-117] Enriching Patent Claim Generation with European Patent Dataset

【速读】: 该论文试图解决专利权利要求书(patent claims)撰写过程中存在的耗时、成本高及专业性强的问题,其解决方案的关键在于引入EPD数据集,即一个欧洲专利数据集。该数据集通过提供具有司法多样性、高质量已授权专利文本以及真实世界挑战模拟的结构化文本数据,支持多种与专利相关的任务,尤其是权利要求生成。EPD不仅填补了现有研究在欧洲专利基准评估方面的空白,还通过提升数据质量与模拟实际应用场景,显著提升了大语言模型(LLMs)在权利要求质量和跨领域泛化能力上的表现。

链接: https://arxiv.org/abs/2505.12568
作者: Lekang Jiang,Chengzu Li,Stephan Goetz
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 13 tables, 4 figures

点击查看摘要

Abstract:Drafting patent claims is time-intensive, costly, and requires professional skill. Therefore, researchers have investigated large language models (LLMs) to assist inventors in writing claims. However, existing work has largely relied on datasets from the United States Patent and Trademark Office (USPTO). To enlarge research scope regarding various jurisdictions, drafting conventions, and legal standards, we introduce EPD, a European patent dataset. EPD presents rich textual data and structured metadata to support multiple patent-related tasks, including claim generation. This dataset enriches the field in three critical aspects: (1) Jurisdictional diversity: Patents from different offices vary in legal and drafting conventions. EPD fills a critical gap by providing a benchmark for European patents to enable more comprehensive evaluation. (2) Quality improvement: EPD offers high-quality granted patents with finalized and legally approved texts, whereas others consist of patent applications that are unexamined or provisional. Experiments show that LLMs fine-tuned on EPD significantly outperform those trained on previous datasets and even GPT-4o in claim quality and cross-domain generalization. (3) Real-world simulation: We propose a difficult subset of EPD to better reflect real-world challenges of claim generation. Results reveal that all tested LLMs perform substantially worse on these challenging samples, which highlights the need for future research.
zh

[NLP-118] mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model

【速读】: 该论文试图解决大型语言模型(LLMs)在生成具有药物样性质的新分子方面能力有限的问题,以及生成的分子在实验室中难以合成的问题。其解决方案的关键在于引入mCLM,一种模块化的化学语言模型,该模型通过将分子分解为功能构建块(functional building blocks)进行分词,并学习自然语言描述与分子构建块之间的双语语言模型,从而实现更高效可合成分子的生成和分子功能的系统性提升。

链接: https://arxiv.org/abs/2505.12565
作者: Carl Edwards,Chi Han,Gawon Lee,Thao Nguyen,Bowen Jin,Chetan Kumar Prasad,Sara Szymkuć,Bartosz A. Grzybowski,Ying Diao,Jiawei Han,Ge Liu,Hao Peng,Martin D. Burke,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Allchemy Inc. (Allchemy公司); Ulsan National Institute of Science and Technology (蔚山科学技术院); Institute of Organic Chemistry, Polish Academy of Sciences (波兰科学院有机化学研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Despite their ability to understand chemical knowledge and accurately generate sequential representations, large language models (LLMs) remain limited in their capacity to propose novel molecules with drug-like properties. In addition, the molecules that LLMs propose can often be challenging to make in the lab. To more effectively enable the discovery of functional small molecules, LLMs need to learn a molecular language. However, LLMs are currently limited by encoding molecules from atoms. In this paper, we argue that just like tokenizing texts into (sub-)word tokens instead of characters, molecules should be decomposed and reassembled at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model tokenizing molecules into building blocks and learning a bilingual language model of both natural language descriptions of functions and molecule building blocks. By reasoning on such functional building blocks, mCLM guarantees to generate efficiently synthesizable molecules thanks to recent progress in block-based chemistry, while also improving the functions of molecules in a principled manner. In experiments on 430 FDA-approved drugs, we find mCLM capable of significantly improving 5 out of 6 chemical functions critical to determining drug potentials. More importantly, mCLM can reason on multiple functions and improve the FDA-rejected drugs (``fallen angels’') over multiple iterations to greatly improve their shortcomings.
zh

[NLP-119] he taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

【速读】: 该论文试图解决跨语言研究中数据分布不均的问题,即现有数据集通常要么针对少数语言提供大量数据,要么针对大量语言仅提供少量数据,从而限制了对人类语言能力普遍属性的揭示。解决方案的关键是构建一个大规模自动标注的平行语料库(taggedPBC),该语料库包含来自1500多种语言的超过1800句词性标注的平行文本,覆盖133个语系和111种孤立语言,其标注准确性与当前最先进的标注工具(如SpaCy、Trankit)以及人工标注语料库(Universal Dependencies Treebanks)具有良好的相关性。此外,基于该数据集提出的N1比例指标能够与三种类型学数据库(WALS、Grambank、Autotyp)中的专家词序判断相关联,为未在这些数据库中的语言的基本词序识别提供了有效方法。

链接: https://arxiv.org/abs/2505.12560
作者: Hiram Ring
机构: NTU Singapore(南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large automatically tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages, representing 133 language families and 111 isolates, dwarfing previously available resources. The accuracy of tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of word order in three typological databases (WALS, Grambank, Autotyp) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.
zh

[NLP-120] Extracting memorized pieces of (copyrighted) books from open-weight language models

【速读】: 该论文试图解决版权诉讼中关于生成式 AI (Generative AI) 是否大规模记忆了受版权保护的文本内容的问题,这一问题在涉及大型语言模型(LLMs)的案件中常引发激烈争议。论文的关键解决方案是利用一种近期的概率提取技术,从13个开源权重的LLMs中提取Books3数据集中的文本片段,并通过实验验证这些模型是否在参数中存储了这些内容,从而揭示记忆程度与版权之间的复杂关系。

链接: https://arxiv.org/abs/2505.12546
作者: A. Feder Cooper,Aaron Gokaslan,Amy B. Cyphert,Christopher De Sa,Mark A. Lemley,Daniel E. Ho,Percy Liang
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs’ protected expression. Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we leverage a recent probabilistic extraction technique to extract pieces of the Books3 dataset from 13 open-weight LLMs. Through numerous experiments, we show that it’s possible to extract substantial parts of at least some books from different LLMs. This is evidence that the LLMs have memorized the extracted text; this memorized content is copied inside the model parameters. But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don’t memorize most books – either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
zh

[NLP-121] owards Reliable and Interpretable Traffic Crash Pattern Prediction and Safety Interventions Using Customized Large Language Models

【速读】: 该论文旨在解决现有方法在处理多源交通碰撞数据(包括数值特征、文本报告、碰撞图像、环境条件和驾驶员行为记录)时难以解释其复杂相互作用的问题,从而无法充分捕捉数据中的语义信息和内在关联,限制了对关键碰撞风险因素的识别能力。解决方案的关键在于提出TrafficSafe框架,该框架通过将大型语言模型(LLM)适配为基于文本的推理任务,重新定义碰撞预测与特征归因问题,并构建了一个包含58,903条真实世界报告的多模态碰撞数据集,通过定制化和微调LLM实现了更高的F1分数表现。此外,引入的TrafficSafe Attribution框架能够进行句级特征归因,支持条件风险分析,揭示了酒精影响驾驶等关键风险因素的作用。

链接: https://arxiv.org/abs/2505.12545
作者: Yang Zhao,Pu Wang,Yibo Zhao,Hongru Du, Hao (Frank)Yang
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: Last revised 13 Feb 2025. Under review in Nature portfolio

点击查看摘要

Abstract:Predicting crash events is crucial for understanding crash distributions and their contributing factors, thereby enabling the design of proactive traffic safety policy interventions. However, existing methods struggle to interpret the complex interplay among various sources of traffic crash data, including numeric characteristics, textual reports, crash imagery, environmental conditions, and driver behavior records. As a result, they often fail to capture the rich semantic information and intricate interrelationships embedded in these diverse data sources, limiting their ability to identify critical crash risk factors. In this research, we propose TrafficSafe, a framework that adapts LLMs to reframe crash prediction and feature attribution as text-based reasoning. A multi-modal crash dataset including 58,903 real-world reports together with belonged infrastructure, environmental, driver, and vehicle information is collected and textualized into TrafficSafe Event Dataset. By customizing and fine-tuning LLMs on this dataset, the TrafficSafe LLM achieves a 42% average improvement in F1-score over baselines. To interpret these predictions and uncover contributing factors, we introduce TrafficSafe Attribution, a sentence-level feature attribution framework enabling conditional risk analysis. Findings show that alcohol-impaired driving is the leading factor in severe crashes, with aggressive and impairment-related behaviors having nearly twice the contribution for severe crashes compared to other driver behaviors. Furthermore, TrafficSafe Attribution highlights pivotal features during model training, guiding strategic crash data collection for iterative performance improvements. The proposed TrafficSafe offers a transformative leap in traffic safety research, providing a blueprint for translating advanced AI technologies into responsible, actionable, and life-saving outcomes.
zh

[NLP-122] Disambiguation in Conversational Question Answering in the Era of LLM : A Survey

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中由于人类语言固有的复杂性和灵活性所带来的歧义问题。随着大型语言模型(Large Language Models, LLMs)的出现,这一问题变得更加关键,因为它们具备更广泛的能力和应用场景。论文的核心在于探讨歧义的定义、形式及其对语言驱动系统的影响,并提出由LLMs支持的多种消歧方法,分析其优缺点,同时梳理可用于评估歧义检测与解决技术的公开数据集。解决方案的关键在于通过系统性分类和比较不同消歧方法,为未来研究提供方向并推动更稳健、可靠的语言系统的发展。

链接: https://arxiv.org/abs/2505.12543
作者: Md Mehrab Tanjim,Yeonjun In,Xiang Chen,Victor S. Bursztyn,Ryan A. Rossi,Sungchul Kim,Guang-Jie Ren,Vaishnavi Muppala,Shun Jiang,Yongsung Kim,Chanyoung Park
机构: Adobe Research(Adobe 研究院); KAIST(韩国科学技术院); Adobe Inc.(Adobe 公司)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Ambiguity remains a fundamental challenge in Natural Language Processing (NLP) due to the inherent complexity and flexibility of human language. With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications. In the context of Conversational Question Answering (CQA), this paper explores the definition, forms, and implications of ambiguity for language driven systems, particularly in the context of LLMs. We define key terms and concepts, categorize various disambiguation approaches enabled by LLMs, and provide a comparative analysis of their advantages and disadvantages. We also explore publicly available datasets for benchmarking ambiguity detection and resolution techniques and highlight their relevance for ongoing research. Finally, we identify open problems and future research directions, proposing areas for further investigation. By offering a comprehensive review of current research on ambiguities and disambiguation with LLMs, we aim to contribute to the development of more robust and reliable language systems.
zh

[NLP-123] Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE

【速读】: 该论文试图解决关系抽取(Relation Extraction, RE)模型在泛化能力上的问题,即评估这些模型是否学习了稳健的关联模式,还是依赖于虚假相关性。其关键解决方案在于揭示数据质量对模型迁移能力的重要性,而非词汇相似性,并指出适应策略的选择应基于数据质量:高质量数据下微调(fine-tuning)表现最佳,而噪声数据下少样本上下文学习(few-shot in-context learning, ICL)更为有效。此外,研究还强调了RE基准测试中的结构性问题对模型迁移能力的阻碍作用。

链接: https://arxiv.org/abs/2505.12533
作者: Varvara Arzt,Allan Hanbury,Michael Wiegand,Gábor Recski,Terra Blevins
机构: Faculty of Informatics, TU Wien (信息学学院,维也纳技术大学); D!ARC, University of Klagenfurt (数字艺术与研究中心,克拉根福大学); Digital Philology, University of Vienna (数字语文学,维也纳大学); Faculty of Computer Science, University of Vienna (计算机科学学院,维也纳大学); Khoury College of Computer Sciences, Northeastern University (Khoury计算机科学学院,东北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Analysing the generalisation capabilities of relation extraction (RE) models is crucial for assessing whether they learn robust relational patterns or rely on spurious correlations. Our cross-dataset experiments find that RE models struggle with unseen data, even within similar domains. Notably, higher intra-dataset performance does not indicate better transferability, instead often signaling overfitting to dataset-specific artefacts. Our results also show that data quality, rather than lexical similarity, is key to robust transfer, and the choice of optimal adaptation strategy depends on the quality of data available: while fine-tuning yields the best cross-dataset performance with high-quality data, few-shot in-context learning (ICL) is more effective with noisier data. However, even in these cases, zero-shot baselines occasionally outperform all cross-dataset results. Structural issues in RE benchmarks, such as single-relation per sample constraints and non-standardised negative class definitions, further hinder model transferability.
zh

[NLP-124] ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents

【速读】: 该论文试图解决当前大规模语言模型(Large Language Models, LLMs)在心理健康聊天机器人中的应用缺乏可扩展且理论基础明确的评估方法的问题。其解决方案的关键在于提出ESC-Judge,这是一个端到端的评估框架,它基于Clara Hill的探索-洞察-行动(Exploration-Insight-Action)咨询模型,提供结构化且可解释的性能视图,并实现了评估流程的全面自动化。该框架通过合成真实的情感支持角色、让候选模型独立进行对话,并由专门的判断模型根据评估标准进行偏好排序,从而实现了高效且可靠的人类水平评估。

链接: https://arxiv.org/abs/2505.12531
作者: Navid Madani,Rohini Srihari
机构: University at Buffalo (纽约州立大学布法罗分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is most effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparisons of emotional-support LLMs in Clara Hill’s established Exploration-Insight-Action counseling model, providing a structured and interpretable view of performance, and (ii) fully automates the evaluation pipeline at scale. ESC-Judge operates in three stages: first, it synthesizes realistic help-seeker roles by sampling empirically salient attributes such as stressors, personality, and life history; second, it has two candidate support agents conduct separate sessions with the same role, isolating model-specific strategies; and third, it asks a specialized judge LLM to express pairwise preferences across rubric-anchored skills that span the Exploration, Insight, and Action spectrum. In our study, ESC-Judge matched PhD-level annotators on 85 percent of Exploration, 83 percent of Insight, and 86 percent of Action decisions, demonstrating human-level reliability at a fraction of the cost. All code, prompts, synthetic roles, transcripts, and judgment scripts are released to promote transparent progress in emotionally supportive AI.
zh

[NLP-125] DS-ProGen: A Dual-Structure Deep Language Model for Functional Protein Design

【速读】: 该论文旨在解决蛋白质设计中的逆向折叠问题(Inverse Protein Folding, IPF),即根据给定的三维(3D)结构生成能够正确折叠的氨基酸序列。现有方法通常仅依赖主链坐标或分子表面特征,难以全面捕捉精确序列预测所需的复杂化学与几何约束。其解决方案的关键在于提出DS-ProGen,这是一种结合主链几何结构和表面层次表示的双结构深度语言模型,通过将主链坐标以及表面化学和几何描述符整合到下一个氨基酸预测范式中,从而生成功能相关且结构稳定的序列,同时满足全局和局部构象约束。

链接: https://arxiv.org/abs/2505.12511
作者: Yanting Li,Jiyue Jiang,Zikang Wang,Ziqian Lin,Dongchen He,Yuheng Shan,Yanruisheng Shao,Jiayi Li,Xiangyu Shi,Jiuming Wang,Yanyu Chen,Yimin Fan,Han Li,Yu Li
机构: The Chinese University of Hong Kong (香港中文大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inverse Protein Folding (IPF) is a critical subtask in the field of protein design, aiming to engineer amino acid sequences capable of folding correctly into a specified three-dimensional (3D) conformation. Although substantial progress has been achieved in recent years, existing methods generally rely on either backbone coordinates or molecular surface features alone, which restricts their ability to fully capture the complex chemical and geometric constraints necessary for precise sequence prediction. To address this limitation, we present DS-ProGen, a dual-structure deep language model for functional protein design, which integrates both backbone geometry and surface-level representations. By incorporating backbone coordinates as well as surface chemical and geometric descriptors into a next-amino-acid prediction paradigm, DS-ProGen is able to generate functionally relevant and structurally stable sequences while satisfying both global and local conformational constraints. On the PRIDE dataset, DS-ProGen attains the current state-of-the-art recovery rate of 61.47%, demonstrating the synergistic advantage of multi-modal structural encoding in protein design. Furthermore, DS-ProGen excels in predicting interactions with a variety of biological partners, including ligands, ions, and RNA, confirming its robust functional retention capabilities.
zh

[NLP-126] LM2otifs : An Explainable Framework for Machine-Generated Texts Detection

【速读】: 该论文试图解决大规模语言模型生成文本(Machine-Generated Text, MGT)与人类生成文本(Human-Generated Text, HGT)在作者身份认证中的区分问题,特别是现有检测方法缺乏可解释性的挑战。解决方案的关键在于提出一种名为LM^2 otifs的可解释框架,该框架基于概率图模型理论,利用可解释图神经网络(eXplainable Graph Neural Networks)实现准确的检测与可解释性。其核心流程包括文本到图的转换、图神经网络预测以及后处理可解释性方法提取可解释的模式,从而提供从词语到句法结构的多层级解释。

链接: https://arxiv.org/abs/2505.12507
作者: Xu Zheng,Zhuomin Chen,Esteban Schafir,Sipeng Chen,Hojat Allah Salehi,Haifeng Chen,Farhad Shirani,Wei Cheng,Dongsheng Luo
机构: Florida International University (佛罗里达国际大学); NEC Laboratories America (美国NEC实验室)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The impressive ability of large language models to generate natural text across various tasks has led to critical challenges in authorship authentication. Although numerous detection methods have been developed to differentiate between machine-generated texts (MGT) and human-generated texts (HGT), the explainability of these methods remains a significant gap. Traditional explainability techniques often fall short in capturing the complex word relationships that distinguish HGT from MGT. To address this limitation, we present LM ^2 otifs, a novel explainable framework for MGT detection. Inspired by probabilistic graphical models, we provide a theoretical rationale for the effectiveness. LM ^2 otifs utilizes eXplainable Graph Neural Networks to achieve both accurate detection and interpretability. The LM ^2 otifs pipeline operates in three key stages: first, it transforms text into graphs based on word co-occurrence to represent lexical dependencies; second, graph neural networks are used for prediction; and third, a post-hoc explainability method extracts interpretable motifs, offering multi-level explanations from individual words to sentence structures. Extensive experiments on multiple benchmark datasets demonstrate the comparable performance of LM ^2 otifs. The empirical evaluation of the extracted explainable motifs confirms their effectiveness in differentiating HGT and MGT. Furthermore, qualitative analysis reveals distinct and visible linguistic fingerprints characteristic of MGT.
zh

[NLP-127] KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation

【速读】: 该论文试图解决现代语言模型在处理长文档时,缺乏系统性评估其信息检索与处理能力的基准问题。解决方案的关键在于提出KG-QAGen(基于知识图谱的问答生成)框架,该框架通过金融协议的结构化表示,在多跳检索、集合运算和答案多样性三个关键维度上生成多层次复杂度的问答对,从而实现对模型性能的细粒度评估。

链接: https://arxiv.org/abs/2505.12495
作者: Nikita Tatarinov,Vidhyakshaya Kannan,Haricharana Srinivasa,Arnav Raj,Harpreet Singh Anand,Varun Singh,Aditya Luthra,Ravij Lade,Agam Shah,Sudheer Chava
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing context length of modern language models has created a need for evaluating their ability to retrieve and process information across extensive documents. While existing benchmarks test long-context capabilities, they often lack a structured way to systematically vary question complexity. We introduce KG-QAGen (Knowledge-Graph-based Question-Answer Generation), a framework that (1) extracts QA pairs at multiple complexity levels (2) by leveraging structured representations of financial agreements (3) along three key dimensions – multi-hop retrieval, set operations, and answer plurality – enabling fine-grained assessment of model performance across controlled difficulty levels. Using this framework, we construct a dataset of 20,139 QA pairs (the largest number among the long-context benchmarks) and open-source a part of it. We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons and multi-hop logical inference. Our analysis reveals systematic failure modes tied to semantic misinterpretation and inability to handle implicit relations.
zh

[NLP-128] Enhancing Large Language Models with Reward-guided Tree Search for Knowledge Graph Question and Answering

【速读】: 该论文旨在解决知识图谱问答(KGQA)任务中现有基于大语言模型(LLMs)的方法在探索新最优推理路径时忽视历史推理路径利用的问题,以及复杂语义导致检索到不准确推理路径的问题。其解决方案的关键在于提出一种无需训练的框架——基于图的奖励引导树搜索(RTSoG),该框架通过将原始问题分解为一系列更简单且定义明确的子问题来处理复杂语义,随后引入由奖励模型引导的自评蒙特卡洛树搜索(SC-MCTS)迭代检索加权推理路径作为上下文知识,并根据权重堆叠这些路径以生成最终答案。

链接: https://arxiv.org/abs/2505.12476
作者: Xiao Long,Liansheng Zhuang,Chen Shen,Shaotian Yan,Yifei Li,Shafei Wang
机构: University of Science and Technology of China (USTC)(中国科学技术大学); Peng Cheng Laboratory(鹏城实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated impressive performance in Knowledge Graph Question Answering (KGQA) tasks, which aim to find answers based on knowledge graphs (KGs) for natural language questions. Existing LLMs-based KGQA methods typically follow the Graph Retrieval-Augmented Generation (GraphRAG) paradigm, which first retrieves reasoning paths from the large KGs, and then generates the answers based on them. However, these methods emphasize the exploration of new optimal reasoning paths in KGs while ignoring the exploitation of historical reasoning paths, which may lead to sub-optimal reasoning paths. Additionally, the complex semantics contained in questions may lead to the retrieval of inaccurate reasoning paths. To address these issues, this paper proposes a novel and training-free framework for KGQA tasks called Reward-guided Tree Search on Graph (RTSoG). RTSoG decomposes an original question into a series of simpler and well-defined sub-questions to handle the complex semantics. Then, a Self-Critic Monte Carlo Tree Search (SC-MCTS) guided by a reward model is introduced to iteratively retrieve weighted reasoning paths as contextual knowledge. Finally, it stacks the weighted reasoning paths according to their weights to generate the final answers. Extensive experiments on four datasets demonstrate the effectiveness of RTSoG. Notably, it achieves 8.7% and 7.0% performance improvement over the state-of-the-art method on the GrailQA and the WebQSP respectively.
zh

[NLP-129] What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion Summarization EMNLP2025

【速读】: 该论文试图解决现有对话摘要系统因仅依赖讨论信息而导致的外部观察者混淆问题(external observer confusion)。其关键解决方案是将任务输出建模为背景和观点摘要,并定义两种标准化的摘要模式,同时引入首个由人类专家一致标注的高质量基准数据集以及一种具有细粒度可解释指标的分层评估框架。

链接: https://arxiv.org/abs/2505.12474
作者: Weixiao Zhou,Junnan Zhu,Gengyao Li,Xianfu Cheng,Xinnian Liang,Feifei Zhai,Zhoujun Li
机构: Beihang University (北京航空航天大学); Institute of Automation, CAS (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); ByteDance (字节跳动); Fanyu AI Laboratory (凡宇人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP 2025

点击查看摘要

Abstract:In this work, we investigate the performance of LLMs on a new task that requires combining discussion with background knowledge for summarization. This aims to address the limitation of outside observer confusion in existing dialogue summarization systems due to their reliance solely on discussion information. To achieve this, we model the task output as background and opinion summaries and define two standardized summarization patterns. To support assessment, we introduce the first benchmark comprising high-quality samples consistently annotated by human experts and propose a novel hierarchical evaluation framework with fine-grained, interpretable metrics. We evaluate 12 LLMs under structured-prompt and self-reflection paradigms. Our findings reveal: (1) LLMs struggle with background summary retrieval, generation, and opinion summary integration. (2) Even top LLMs achieve less than 69% average performance across both patterns. (3) Current LLMs lack adequate self-evaluation and self-correction capabilities for this task.
zh

[NLP-130] UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)微调过程中计算成本过高的问题,主要原因是策略优化和评估需要多样本采样,因此高效的数据选择变得至关重要。解决方案的关键在于引入UFO-RL框架,该框架通过计算高效的单次不确定性估计来识别具有信息量的数据实例,从而实现数据评估速度提升至185倍,并基于此度量选择位于估计的最近发展区(Zone of Proximal Development, ZPD)内的数据进行训练,显著降低了训练时间并提升了模型性能与泛化能力。

链接: https://arxiv.org/abs/2505.12457
作者: Yang Zhao,Kai Xiong,Xiao Ding,Li Du,YangouOuyang,Zhouhao Sun,Jiannan Guan,Wenbin Zhang,Bin Liu,Dong Hu,Bing Qin,Ting Liu
机构: Harbin Institute of Technology, China (哈尔滨工业大学); Beijing Academy of Artificial Intelligence, Beijing, China (北京人工智能研究院); Du Xiaoman Technology (Beijing) Co., Ltd. (杜小满科技(北京)有限公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling RL for LLMs is computationally expensive, largely due to multi-sampling for policy optimization and evaluation, making efficient data selection crucial. Inspired by the Zone of Proximal Development (ZPD) theory, we hypothesize LLMs learn best from data within their potential comprehension zone. Addressing the limitation of conventional, computationally intensive multi-sampling methods for data assessment, we introduce UFO-RL. This novel framework uses a computationally efficient single-pass uncertainty estimation to identify informative data instances, achieving up to 185x faster data evaluation. UFO-RL leverages this metric to select data within the estimated ZPD for training. Experiments show that training with just 10% of data selected by UFO-RL yields performance comparable to or surpassing full-data training, reducing overall training time by up to 16x while enhancing stability and generalization. UFO-RL offers a practical and highly efficient strategy for scaling RL fine-tuning of LLMs by focusing learning on valuable data.
zh

[NLP-131] owards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations

【速读】: 该论文旨在解决远距离监督命名实体识别(DS-NER)中不同远距离标注方法之间的潜在噪声分布问题,以及由此带来的模型效果不稳定和性能受限的问题。其解决方案的关键在于从两个方面进行探索:一是改进远距离标注技术,涵盖传统基于规则的方法和创新的大语言模型监督方法;二是提出一种新的噪声评估框架,该框架将挑战细分为未标注实体问题(UEP)和噪声实体问题(NEP),并为每种问题提供针对性的解决策略。

链接: https://arxiv.org/abs/2505.12454
作者: Yuyang Ding,Dan Qiao,Juntao Li,Jiajie Xu,Pingfu Chao,Xiaofang Zhou,Min Zhang
机构: Soochow University (苏州大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the unlabeled-entity problem (UEP) and the noisy-entity problem (NEP), subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.
zh

[NLP-132] Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)内部知识虽存在但未被有效激活和利用的问题,尤其是在需要精细语义区分的领域中,模型的理解能力受限。解决方案的关键在于提出“自问自答”(self-questioning)策略,通过引导模型生成并回答自身问题,从而激活其潜在的背景知识,提升对技术概念的理解能力。该方法在包含大量技术术语和复杂写作的计算机科学专利对数据集上表现出显著效果,表明自问自答能够有效增强模型的内部知识利用效率,并为跨模型协作提供了新思路。

链接: https://arxiv.org/abs/2505.12452
作者: Siyang Wu,Honglin Bao,Nadav Kunievsky,James A. Evans
机构: University of Chicago (芝加哥大学); Knowledge Lab (知识实验室); Data Science Institute (数据科学研究所)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: We commit to fully open-source our patent dataset

点击查看摘要

Abstract:Large language models (LLMs) increasingly demonstrate signs of conceptual understanding, yet much of their internal knowledge remains latent, loosely structured, and difficult to access or evaluate. We propose self-questioning as a lightweight and scalable strategy to improve LLMs’ understanding, particularly in domains where success depends on fine-grained semantic distinctions. To evaluate this approach, we introduce a challenging new benchmark of 1.3 million post-2015 computer science patent pairs, characterized by dense technical jargon and strategically complex writing. The benchmark centers on a pairwise differentiation task: can a model distinguish between closely related but substantively different inventions? We show that prompting LLMs to generate and answer their own questions - targeting the background knowledge required for the task - significantly improves performance. These self-generated questions and answers activate otherwise underutilized internal knowledge. Allowing LLMs to retrieve answers from external scientific texts further enhances performance, suggesting that model knowledge is compressed and lacks the full richness of the training data. We also find that chain-of-thought prompting and self-questioning converge, though self-questioning remains more effective for improving understanding of technical concepts. Notably, we uncover an asymmetry in prompting: smaller models often generate more fundamental, more open-ended, better-aligned questions for mid-sized models than large models with better understanding do, revealing a new strategy for cross-model collaboration. Altogether, our findings establish self-questioning as both a practical mechanism for automatically improving LLM comprehension, especially in domains with sparse and underrepresented knowledge, and a diagnostic probe of how internal and external knowledge are organized.
zh

[NLP-133] IP Leakage Attacks Targeting LLM -Based Multi-Agent Systems

【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在协作执行复杂任务过程中面临的知识产权(Intellectual Property, IP)保护问题。其解决方案的关键在于提出一种名为MASLEAK的新型攻击框架,该框架能够在黑盒环境下通过公共API与MAS交互,精心构造对抗性查询以诱导、传播并保留揭示完整专有组件的响应,包括代理数量、系统拓扑、系统提示、任务指令和工具使用情况。

链接: https://arxiv.org/abs/2505.12442
作者: Liwen Wang,Wenxuan Wang,Shuai Wang,Zongjie Li,Zhenlan Ji,Zongyi Lyu,Daoyuan Wu,Shing-Chi Cheung
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems (MAS) to perform complex tasks through collaboration. However, the intricate nature of MAS, including their architecture and agent interactions, raises significant concerns regarding intellectual property (IP) protection. In this paper, we introduce MASLEAK, a novel attack framework designed to extract sensitive information from MAS applications. MASLEAK targets a practical, black-box setting, where the adversary has no prior knowledge of the MAS architecture or agent configurations. The adversary can only interact with the MAS through its public API, submitting attack query q and observing outputs from the final agent. Inspired by how computer worms propagate and infect vulnerable network hosts, MASLEAK carefully crafts adversarial query q to elicit, propagate, and retain responses from each MAS agent that reveal a full set of proprietary components, including the number of agents, system topology, system prompts, task instructions, and tool usages. We construct the first synthetic dataset of MAS applications with 810 applications and also evaluate MASLEAK against real-world MAS applications, including Coze and CrewAI. MASLEAK achieves high accuracy in extracting MAS IP, with an average attack success rate of 87% for system prompts and task instructions, and 92% for system architecture in most cases. We conclude by discussing the implications of our findings and the potential defenses.
zh

[NLP-134] Learning to Play Like Humans: A Framework for LLM Adaptation in Interactive Fiction Games

【速读】: 该论文试图解决当前交互式小说(Interactive Fiction, IF)游戏中,基于人工智能的代理在理解叙事背景和游戏逻辑方面存在不足的问题,现有方法过于侧重任务特定性能指标,而缺乏对人类认知层面的模拟。其解决方案的关键在于提出一种受认知科学启发的框架——LPLH(Learning to Play Like Humans),该框架通过结构化地图构建、动作学习以及反馈驱动的经验分析三个核心组件,使大型语言模型(Large Language Models, LLMs)能够系统性地学习并模仿人类玩家的行为,从而实现更符合叙事意图和常识约束的决策过程。

链接: https://arxiv.org/abs/2505.12439
作者: Jinming Zhang,Yunfei Long
机构: University of Essex (埃塞克斯大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interactive Fiction games (IF games) are where players interact through natural language commands. While recent advances in Artificial Intelligence agents have reignited interest in IF games as a domain for studying decision-making, existing approaches prioritize task-specific performance metrics over human-like comprehension of narrative context and gameplay logic. This work presents a cognitively inspired framework that guides Large Language Models (LLMs) to learn and play IF games systematically. Our proposed Learning to Play Like Humans (LPLH) framework integrates three key components: (1) structured map building to capture spatial and narrative relationships, (2) action learning to identify context-appropriate commands, and (3) feedback-driven experience analysis to refine decision-making over time. By aligning LLMs-based agents’ behavior with narrative intent and commonsense constraints, LPLH moves beyond purely exploratory strategies to deliver more interpretable, human-like performance. Crucially, this approach draws on cognitive science principles to more closely simulate how human players read, interpret, and respond within narrative worlds. As a result, LPLH reframes the IF games challenge as a learning problem for LLMs-based agents, offering a new path toward robust, context-aware gameplay in complex text-based environments.
zh

[NLP-135] PSC: Extending Context Window of Large Language Models via Phase Shift Calibration

【速读】: 该论文旨在解决现有基于Rotary Position Embedding (RoPE)的方法在扩展上下文窗口时难以预先定义最优缩放因子的问题,这一问题源于指数级增长的搜索空间。其解决方案的关键在于引入PSC(Phase Shift Calibration),这是一个用于校准现有方法预定义频率的小型模块,通过PSC的引入,能够显著提升如PI、YaRN和LongRoPE等方法的性能,并在不同模型和任务中展现出广泛的适用性和鲁棒性。

链接: https://arxiv.org/abs/2505.12423
作者: Wenqiao Zhu,Chao Xu,Lulu Wang,Jun Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) is an efficient position encoding approach and is widely utilized in numerous large language models (LLMs). Recently, a lot of methods have been put forward to further expand the context window based on RoPE. The core concept of those methods is to predefine or search for a set of factors to rescale the base frequencies of RoPE. Nevertheless, it is quite a challenge for existing methods to predefine an optimal factor due to the exponential search space. In view of this, we introduce PSC (Phase Shift Calibration), a small module for calibrating the frequencies predefined by existing methods. With the employment of PSC, we demonstrate that many existing methods can be further enhanced, like PI, YaRN, and LongRoPE. We conducted extensive experiments across multiple models and tasks. The results demonstrate that (1) when PSC is enabled, the comparative reductions in perplexity increase as the context window size is varied from 16k, to 32k, and up to 64k. (2) Our approach is broadly applicable and exhibits robustness across a variety of models and tasks. The code can be found at this https URL.
zh

[NLP-136] able-R1: Region-based Reinforcement Learning for Table Understanding

【速读】: 该论文旨在解决语言模型在表格问答任务中的性能优化问题,特别是针对表格结构化行-列交互带来的挑战。其解决方案的关键在于提出了一种基于区域的强化学习方法——Table-R1,该方法通过引入区域证据增强推理步骤,结合Region-Enhanced Supervised Fine-Tuning (RE-SFT) 和Table-Aware Group Relative Policy Optimization (TARPO),提升了模型对表格区域的识别能力及推理过程的准确性与效率。

链接: https://arxiv.org/abs/2505.12415
作者: Zhenhe Wu,Jian Yang,Jiaheng Liu,Xianjie Wu,Changzai Pan,Jie Zhang,Yu Zhao,Shuangyong Song,Yongxiang Li,Zhoujun Li
机构: Beihang University (北京航空航天大学); TeleAI, China Telecom Corp Ltd (中国电信集团有限公司); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tables present unique challenges for language models due to their structured row-column interactions, necessitating specialized approaches for effective comprehension. While large language models (LLMs) have demonstrated potential in table reasoning through prompting and techniques like chain-of-thought (CoT) and program-of-thought (PoT), optimizing their performance for table question answering remains underexplored. In this paper, we introduce region-based Table-R1, a novel reinforcement learning approach that enhances LLM table understanding by integrating region evidence into reasoning steps. Our method employs Region-Enhanced Supervised Fine-Tuning (RE-SFT) to guide models in identifying relevant table regions before generating answers, incorporating textual, symbolic, and program-based reasoning. Additionally, Table-Aware Group Relative Policy Optimization (TARPO) introduces a mixed reward system to dynamically balance region accuracy and answer correctness, with decaying region rewards and consistency penalties to align reasoning steps. Experiments show that Table-R1 achieves an average performance improvement of 14.36 points across multiple base models on three benchmark datasets, even outperforming baseline models with ten times the parameters, while TARPO reduces response token consumption by 67.5% compared to GRPO, significantly advancing LLM capabilities in efficient tabular reasoning.
zh

[NLP-137] he power of text similarity in identifying AI-LLM paraphrased documents: The case of BBC news articles and ChatGPT

【速读】: 该论文试图解决生成式 AI(Generative AI)生成的改写文本可能引发的版权侵权问题,以及由此导致原创内容创作者收入损失的问题。其解决方案的关键在于提出一种基于模式相似性检测的算法方案,该方案不仅能够识别文章是否为 AI 改写,更重要的是能够确定侵权来源是否为 ChatGPT。该方法不依赖深度学习技术,而是通过分析文本中的模式相似性实现高精度的检测,实验结果表明其在准确率、精确率、灵敏度、特异性和 F1 分数等方面均达到 96% 以上。

链接: https://arxiv.org/abs/2505.12405
作者: Konstantinos Xylogiannopoulos,Petros Xanthopoulos,Panagiotis Karampelas,Georgios Bakamitsos
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI paraphrased text can be used for copyright infringement and the AI paraphrased content can deprive substantial revenue from original content creators. Despite this recent surge of malicious use of generative AI, there are few academic publications that research this threat. In this article, we demonstrate the ability of pattern-based similarity detection for AI paraphrased news recognition. We propose an algorithmic scheme, which is not limited to detect whether an article is an AI paraphrase, but, more importantly, to identify that the source of infringement is the ChatGPT. The proposed method is tested with a benchmark dataset specifically created for this task that incorporates real articles from BBC, incorporating a total of 2,224 articles across five different news categories, as well as 2,224 paraphrased articles created with ChatGPT. Results show that our pattern similarity-based method, that makes no use of deep learning, can detect ChatGPT assisted paraphrased articles at percentages 96.23% for accuracy, 96.25% for precision, 96.21% for sensitivity, 96.25% for specificity and 96.23% for F1 score.
zh

[NLP-138] raversal Verification for Speculative Tree Decoding

【速读】: 该论文旨在解决传统推测解码(speculative decoding)方法中因依赖逐标记验证机制而导致的接受率低和推测候选利用效率不足的问题。现有框架在每个时间步构建包含多个候选标记的标记树,但由于验证过程从根节点开始逐层进行,一旦父节点被拒绝,其所有子节点都会被丢弃,从而导致有效子序列的提前丢失。论文提出的解决方案是引入一种全新的推测解码算法——遍历验证(Traversal Verification),其关键在于通过从叶子节点到根节点的遍历方式,考虑从当前节点到根节点的整个标记序列的接受性,从而保留可能有效的子序列,理论上保证了与目标模型相同的概率分布,实现了无损推理和显著的加速效果。

链接: https://arxiv.org/abs/2505.12398
作者: Yepeng Weng,Qiao Hu,Xujie Chen,Li Liu,Dianwen Mei,Huishi Qiu,Jiang Tian,Zhongchao Shi
机构: Lenovo Advanced AI Technology Center, Lenovo (联想高级人工智能技术中心,联想); National Center for Mathematics and Interdisciplinary Sciences (NCMIS), AMSS, CAS (国家数学与交叉科学中心(NCMIS),中科院数学与系统科学研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token trees containing multiple candidates in each timestep. However, their reliance on token-level verification mechanisms introduces two critical limitations: First, the probability distribution of a sequence differs from that of individual tokens, leading to suboptimal acceptance length. Second, current verification schemes begin from the root node and proceed layer by layer in a top-down manner. Once a parent node is rejected, all its child nodes should be discarded, resulting in inefficient utilization of speculative candidates. This paper introduces Traversal Verification, a novel speculative decoding algorithm that fundamentally rethinks the verification paradigm through leaf-to-root traversal. Our approach considers the acceptance of the entire token sequence from the current node to the root, and preserves potentially valid subsequences that would be prematurely discarded by existing methods. We theoretically prove that the probability distribution obtained through Traversal Verification is identical to that of the target model, guaranteeing lossless inference while achieving substantial acceleration gains. Experimental results across different large language models and multiple tasks show that our method consistently improves acceptance length and throughput over existing methods
zh

[NLP-139] SLOT: Sample-specific Language Model Optimization at Test-time

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂指令时表现不佳的问题,特别是在那些未在通用样本中充分表示的指令上。解决方案的关键在于提出SLOT(Sample-specific Language Model Optimization at Test-time),一种在测试阶段进行少量优化以更新轻量级样本特定参数向量的方法。该方法通过在输出头之前的最终隐藏层添加该参数向量,并在每个样本优化过程中缓存最后一层特征,实现高效的适应性调整,从而提升模型对具体指令的对齐与遵循能力。

链接: https://arxiv.org/abs/2505.12392
作者: Yang Hu,Xingyu Zhang,Xueji Fang,Zhiyang Chen,Xiao Wang,Huatian Zhang,Guojun Qi
机构: Westlake University (西湖大学); University of Washington (华盛顿大学); USTC (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model’s ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at this https URL.
zh

[NLP-140] From n-gram to Attention: How Model Architectures Learn and Propagate Bias in Language Modeling

【速读】: 该论文试图解决语言模型(Language Models, LMs)中偏见传播的根源问题,特别是针对数据质量、模型架构以及数据的时间特性对偏见影响的研究不足。其解决方案的关键在于提出一种基于比较行为理论的方法,用以解释训练数据与模型架构在偏见传播过程中的复杂相互作用,从而系统性地追踪偏见的来源,而非仅关注表面现象。

链接: https://arxiv.org/abs/2505.12381
作者: Mohsinul Kabir,Tasfia Tahsin,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); Islamic University of Technology (伊斯兰理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Current research on bias in language models (LMs) predominantly focuses on data quality, with significantly less attention paid to model architecture and temporal influences of data. Even more critically, few studies systematically investigate the origins of bias. We propose a methodology grounded in comparative behavioral theory to interpret the complex interaction between training data and model architecture in bias propagation during language modeling. Building on recent work that relates transformers to n-gram LMs, we evaluate how data, model design choices, and temporal dynamics affect bias propagation. Our findings reveal that: (1) n-gram LMs are highly sensitive to context window size in bias propagation, while transformers demonstrate architectural robustness; (2) the temporal provenance of training data significantly affects bias; and (3) different model architectures respond differentially to controlled bias injection, with certain biases (e.g. sexual orientation) being disproportionately amplified. As language models become ubiquitous, our findings highlight the need for a holistic approach – tracing bias to its origins across both data and model dimensions, not just symptoms, to mitigate harm.
zh

[NLP-141] MedAgent Board: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

【速读】: 该论文试图解决多智能体协作在复杂医疗任务中的实际优势尚不明确的问题,现有评估方法缺乏泛化性且未充分对比单模型和传统方法。其解决方案的关键是引入MedAgentBoard,这是一个全面的基准测试平台,用于系统评估多智能体协作、单大型语言模型(LLM)以及传统方法在四个多样化医疗任务类别中的表现,包括医学(视觉)问答、通俗摘要生成、结构化电子健康记录(EHR)预测建模和临床工作流自动化。通过这一基准,研究揭示了多智能体协作在特定场景下的优势,同时也指出其并不总能超越先进单LLM或专业传统方法,强调了在医疗AI解决方案中需采用任务特定、基于证据的选择与开发策略。

链接: https://arxiv.org/abs/2505.12371
作者: Yinghao Zhu,Ziyi He,Haoran Hu,Xiaochen Zheng,Xichen Zhang,Zixiang Wang,Junyi Gao,Liantao Ma,Lequan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at this https URL.
zh

[NLP-142] CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement ACL

【速读】: 该论文试图解决大型语言模型中提示注入(prompt injection)的安全风险问题,特别是现有防护模型在上下文感知场景下的有效性不足以及过度防御倾向。其解决方案的关键在于提出CAPTURE,这是一个新型的上下文感知基准,能够以最少的领域内示例评估攻击检测能力与过度防御倾向,从而更全面地揭示当前防护模型在对抗性案例中的高假阴性和在良性场景中的过度假阳性问题。

链接: https://arxiv.org/abs/2505.12368
作者: Gauri Kholkar,Ratinder Ahuja
机构: Pure Storage(普瑞存储)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in ACL LLMSec Workshop 2025

点击查看摘要

Abstract:Prompt injection remains a major security risk for large language models. However, the efficacy of existing guardrail models in context-aware settings remains underexplored, as they often rely on static attack benchmarks. Additionally, they have over-defense tendencies. We introduce CAPTURE, a novel context-aware benchmark assessing both attack detection and over-defense tendencies with minimal in-domain examples. Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios, highlighting critical limitations.
zh

[NLP-143] owards Visuospatial Cognition via Hierarchical Fusion of Visual Experts

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉空间认知任务中的不足,即在空间布局、关系和动态推理方面的薄弱表现。现有模型通常缺乏必要的架构组件和专门的训练数据以实现细粒度的空间理解。论文提出的解决方案是引入ViCA2(Visuospatial Cognitive Assistant 2),其关键在于采用双视觉编码器架构,融合SigLIP用于语义理解与Hiera用于空间结构提取,并结合令牌比例控制机制以提高效率,同时构建了大规模的ViCA-322K数据集用于针对性的指令微调。

链接: https://arxiv.org/abs/2505.12363
作者: Qi Feng(1),Hidetoshi Shimodaira(1 and 2) ((1) Kyoto University, (2) RIKEN)
机构: Kyoto University (京都大学); RIKEN (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 26 pages, 19 figures, 4 tables. Code, models, and dataset are available at our project page: this https URL

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, visuospatial cognition - reasoning about spatial layouts, relations, and dynamics - remains a significant challenge. Existing models often lack the necessary architectural components and specialized training data for fine-grained spatial understanding. We introduce ViCA2 (Visuospatial Cognitive Assistant 2), a novel MLLM designed to enhance spatial reasoning. ViCA2 features a dual vision encoder architecture integrating SigLIP for semantics and Hiera for spatial structure, coupled with a token ratio control mechanism for efficiency. We also developed ViCA-322K, a new large-scale dataset with over 322,000 spatially grounded question-answer pairs for targeted instruction tuning. On the challenging VSI-Bench benchmark, our ViCA2-7B model achieves a state-of-the-art average score of 56.8, significantly surpassing larger open-source models (e.g., LLaVA-NeXT-Video-72B, 40.9) and leading proprietary models (Gemini-1.5 Pro, 45.4). This demonstrates the effectiveness of our approach in achieving strong visuospatial intelligence with a compact model. We release ViCA2, its codebase, and the ViCA-322K dataset to facilitate further research.
zh

[NLP-144] Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds IJCAI2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练数据中可能继承并放大偏见的问题。研究发现,LLMs对引发偏见的标题作出的响应往往反映人类偏见,而通过群体响应聚合来减轻偏见的策略需要谨慎设计。解决方案的关键在于采用局部加权聚合方法,相较于简单的平均方法,该方法更有效地利用了LLM群体的智慧,从而实现偏见缓解和准确性的提升。此外,研究还表明,结合LLMs(准确性)与人类(多样性)的混合群体能够进一步提高性能并减少种族和性别相关的偏见。

链接: https://arxiv.org/abs/2505.12349
作者: Axel Abels,Tom Lenaerts
机构: Université Libre de Bruxelles (布鲁塞尔自由大学); Vrije Universiteit Brussel (布鲁塞尔自由大学); FARI, AI for the Common-Good Institute, ULB-VUB (FARI,人工智能共益研究所,ULB-VUB); Center for Human-Compatible AI, UC Berkeley (人类兼容人工智能中心,加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted for publication in the Proceedings of the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the “wisdom of the crowd”, can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.
zh

[NLP-145] UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)编辑数据集在知识领域覆盖范围有限、编辑评估范围狭窄以及忽视编辑需求的广泛性和编辑引发的涟漪效应多样性等问题。其解决方案的关键在于构建一个基于开放域知识的统一基准UniEdit,通过从五个主要类别中的25个常见领域选取实体,并利用开放域知识图谱中的大量三元组知识来构建编辑样本,以确保知识领域的全面覆盖;同时设计了邻域多跳链采样(Neighborhood Multi-hop Chain Sampling, NMCS)算法,以生成涵盖广泛涟漪效应的子图样本,最后通过专有LLMs将采样的知识子图转换为语法准确且句法多样的自然语言文本,从而提升基准的规模、全面性和多样性。

链接: https://arxiv.org/abs/2505.12345
作者: Qizhou Chen,Dakan Wang,Taolin Zhang,Zaoming Yan,Chengsong You,Chengyu Wang,Xiaofeng He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model editing aims to enhance the accuracy and reliability of large language models (LLMs) by efficiently adjusting their internal parameters. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UniEdit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UniEdit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.
zh

[NLP-146] LLM SR@XLLM LLM 25: An Empirical Study of LLM for Structural Reasoning

【速读】: 该论文旨在解决大型语言模型在生成细粒度、可控制且可解释的推理过程中的评估问题(Reasoning Process Evaluation)。其关键解决方案是利用仅有的现成Meta-Llama-3-8B-Instruct模型,设计了一个简洁的少样本、多轮提示(few-shot, multi-turn prompt),该提示首先列举所有问题条件,随后引导模型对每个推理步骤进行标记、引用和裁决。此外,采用基于正则表达式的轻量级后处理模块来规范化文本并强制执行官方JSON模式,从而在无需微调、外部检索或集成的情况下取得了优异的性能表现。

链接: https://arxiv.org/abs/2505.12328
作者: Xinye Li,Mingqi Wan,Dianbo Sui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Team asdfo123’s submission to the LLMSR@XLLM25 shared task, which evaluates large language models on producing fine-grained, controllable, and interpretable reasoning processes. Systems must extract all problem conditions, decompose a chain of thought into statement-evidence pairs, and verify the logical validity of each pair. Leveraging only the off-the-shelf Meta-Llama-3-8B-Instruct, we craft a concise few-shot, multi-turn prompt that first enumerates all conditions and then guides the model to label, cite, and adjudicate every reasoning step. A lightweight post-processor based on regular expressions normalises spans and enforces the official JSON schema. Without fine-tuning, external retrieval, or ensembling, our method ranks 5th overall, achieving macro F1 scores on par with substantially more complex and resource-consuming pipelines. We conclude by analysing the strengths and limitations of our approach and outlining directions for future research in structural reasoning with LLMs. Our code is available at this https URL.
zh

[NLP-147] ExpertSteer: Intervening in LLM s through Expert Knowledge

【速读】: 该论文试图解决在推理过程中引导大型语言模型(Large Language Models, LLMs)遵循期望行为的挑战,尤其是现有方法依赖于模型自身生成的转向向量,限制了其适用范围并无法利用外部专家模型的优势。解决方案的关键在于提出ExpertSteer,该方法通过任意专业专家模型生成转向向量,实现对任何LLMs的干预,其核心步骤包括使用自编码器对齐表示维度、基于互信息分析确定干预层对、利用递归特征机器从专家模型生成转向向量,并在推理阶段将这些向量应用于指定层以选择性引导目标模型,而无需更新模型参数。

链接: https://arxiv.org/abs/2505.12313
作者: Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch
机构: School of Informatics, University of Edinburgh (信息学院,爱丁堡大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit remarkable capabilities across various tasks, yet guiding them to follow desired behaviours during inference remains a significant challenge. Activation steering offers a promising method to control the generation process of LLMs by modifying their internal activations. However, existing methods commonly intervene in the model’s behaviour using steering vectors generated by the model itself, which constrains their effectiveness to that specific model and excludes the possibility of leveraging powerful external expert models for steering. To address these limitations, we propose ExpertSteer, a novel approach that leverages arbitrary specialized expert models to generate steering vectors, enabling intervention in any LLMs. ExpertSteer transfers the knowledge from an expert model to a target LLM through a cohesive four-step process: first aligning representation dimensions with auto-encoders to enable cross-model transfer, then identifying intervention layer pairs based on mutual information analysis, next generating steering vectors from the expert model using Recursive Feature Machines, and finally applying these vectors on the identified layers during inference to selectively guide the target LLM without updating model parameters. We conduct comprehensive experiments using three LLMs on 15 popular benchmarks across four distinct domains. Experiments demonstrate that ExpertSteer significantly outperforms established baselines across diverse tasks at minimal cost.
zh

[NLP-148] Visuospatial Cognitive Assistant

【速读】: 该论文旨在解决视频基础的空间认知问题,这一领域对于机器人技术和具身人工智能至关重要,但当前的视觉-语言模型(Vision-Language Models, VLMs)仍面临挑战。其关键解决方案是构建了一个名为ViCA-322K的大规模数据集,包含来自真实室内视频(如ARKitScenes、ScanNet、ScanNet++)的322,003对问答对,用于支持基于三维元数据的查询和视频复杂推理任务。此外,研究者还开发了在ViCA-322K上微调的ViCA-7B模型,在所有八个VSI-Bench任务中达到了新的最先进性能,同时通过引入ViCA-Thinking-2.68K数据集和对应的ViCA-7B-Thinking模型,增强了模型的空间推理可解释性。

链接: https://arxiv.org/abs/2505.12312
作者: Qi Feng(1),Hidetoshi Shimodaira(1 and 2) ((1) Kyoto University, (2) RIKEN)
机构: Kyoto University (京都大学); RIKEN (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 31 pages, 10 figures, 6 tables. The implementation and fine-tuned model (ViCA-7B) are publicly available at this https URL . The ViCA-322K dataset can be found at this https URL , and the ViCA-Thinking-2.68K dataset is at this https URL

点击查看摘要

Abstract:Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.
zh

[NLP-149] LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

【速读】: 该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在涉及文本丰富的图像上的复杂逻辑推理任务中的表现不足问题,特别是其在减少对领域特定知识依赖的情况下进行推理的能力尚未得到充分探索。解决方案的关键在于构建一个名为LogicOCR的基准测试集,该测试集包含1,100道选择题,用于评估LMMs在文本-rich图像上的逻辑推理能力,其核心方法是通过从中国国家公务员考试的文本语料库中筛选数据,并开发一个可扩展的自动化流程将其转换为多模态样本,包括设计提示模板生成具有多样背景和布局的图像,并进行人工验证以确保质量。

链接: https://arxiv.org/abs/2505.12307
作者: Maoyuan Ye,Jing Zhang,Juhua Liu,Bo Du,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: GitHub: \url{ this https URL }

点击查看摘要

Abstract:Recent advances in Large Multimodal Models (LMMs) have significantly improved their reasoning and Optical Character Recognition (OCR) capabilities. However, their performance on complex logical reasoning tasks involving text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs’ logical reasoning abilities on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by curating a text corpus from the Chinese National Civil Servant Examination and develop a scalable, automated pipeline to convert it into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. We hope LogicOCR will serve as a valuable resource for advancing multimodal reasoning research. The dataset is available at this https URL.
zh

[NLP-150] Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在知识记忆能力方面研究不足的问题,特别是在缺乏标准化和高质量测试平台的情况下。其解决方案的关键在于提出一个名为WikiDYK的新型、现实世界且大规模的知识注入基准,该基准能够持续演化而无需人工干预。WikiDYK利用维基百科“Did You Know…”条目中由专家编辑精心挑选的可验证和清晰的人工编写事实,并将其转化为涵盖多种任务格式的问答对,从而构建了一个可扩展的知识评估体系。

链接: https://arxiv.org/abs/2505.12306
作者: Yuwei Zhang,Wenhao Yu,Shangbin Feng,Yifan Zhu,Letian Peng,Jayanth Srinivasa,Gaowen Liu,Jingbo Shang
机构: UC, San Diego(加州大学圣地亚哥分校); Tencent AI Lab Seattle(腾讯人工智能实验室西雅图); University of Washington(华盛顿大学); Cisco(思科)
类目: Computation and Language (cs.CL)
备注: Dataset is available at this https URL

点击查看摘要

Abstract:Despite significant advances in large language models (LLMs), their knowledge memorization capabilities remain underexplored, due to the lack of standardized and high-quality test ground. In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that evolves continuously over time without requiring human intervention. Specifically, we propose WikiDYK, which leverages recently-added and human-written facts from Wikipedia’s “Did You Know…” entries. These entries are carefully selected by expert Wikipedia editors based on criteria such as verifiability and clarity. Each entry is converted into multiple question-answer pairs spanning diverse task formats from easy cloze prompts to complex multi-hop questions. WikiDYK contains 12,290 facts and 77,180 questions, which is also seamlessly extensible with future updates from Wikipedia editors. Extensive experiments using continued pre-training reveal a surprising insight: despite their prevalence in modern LLMs, Causal Language Models (CLMs) demonstrate significantly weaker knowledge memorization capabilities compared to Bidirectional Language Models (BiLMs), exhibiting a 23% lower accuracy in terms of reliability. To compensate for the smaller scales of current BiLMs, we introduce a modular collaborative framework utilizing ensembles of BiLMs as external knowledge repositories to integrate with LLMs. Experiment shows that our framework further improves the reliability accuracy by up to 29.1%.
zh

[NLP-151] Beyond Single-Point Judgment: Distribution Alignment for LLM -as-a-Judge

【速读】: 该论文试图解决当前基于大语言模型(Large Language Models, LLMs)的评估方法中,由于依赖单点评估而忽视人类评估的内在多样性和不确定性,导致信息丢失和评估可靠性下降的问题。其解决方案的关键在于提出一种新的训练框架,通过显式对齐LLM生成的判断分布与实际人类分布,具体采用基于KL散度的分布对齐目标,并结合辅助交叉熵正则化以稳定训练过程;同时引入对抗训练以增强模型对分布扰动的鲁棒性。

链接: https://arxiv.org/abs/2505.12301
作者: Luyu Chen,Zeyu Zhang,Haoran Tan,Quanyu Dai,Hao Yang,Zhenhua Dong,Xu Chen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 3 tables, 3 figures

点击查看摘要

Abstract:LLMs have emerged as powerful evaluators in the LLM-as-a-Judge paradigm, offering significant efficiency and flexibility compared to human judgments. However, previous methods primarily rely on single-point evaluations, overlooking the inherent diversity and uncertainty in human evaluations. This approach leads to information loss and decreases the reliability of evaluations. To address this limitation, we propose a novel training framework that explicitly aligns the LLM-generated judgment distribution with empirical human distributions. Specifically, we propose a distributional alignment objective based on KL divergence, combined with an auxiliary cross-entropy regularization to stabilize the training process. Furthermore, considering that empirical distributions may derive from limited human annotations, we incorporate adversarial training to enhance model robustness against distribution perturbations. Extensive experiments across various LLM backbones and evaluation tasks demonstrate that our framework significantly outperforms existing closed-source LLMs and conventional single-point alignment methods, with improved alignment quality, evaluation accuracy, and robustness.
zh

[NLP-152] HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models

【速读】: 该论文试图解决在混合多种数据集上微调大语言模型(Large Language Models, LLMs)时面临的数据不平衡和异质性问题。现有方法通常仅从全局角度处理数据分布,而忽略了单个数据集内部的不平衡与异质性,从而限制了微调效果。该论文提出的解决方案是分层平衡优化(Hierarchical Balancing Optimization, HBO),其关键在于采用双层优化策略,包含一个全局策略(Global Actor)用于跨数据集的数据采样平衡,以及多个局部策略(Local Actors)用于根据难度水平优化每个数据集内的数据使用。这些策略通过基于模型训练状态的奖励函数进行指导,从而实现对数据分配的自主调整。

链接: https://arxiv.org/abs/2505.12300
作者: Weixuan Wang,Minghao Wu,Barry Haddow,Alexandra Birch
机构: School of Informatics, University of Edinburgh (信息学院,爱丁堡大学); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM’s training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
zh

[NLP-153] Enhance Mobile Agents Thinking Process Via Iterative Preference Learning

【速读】: 该论文旨在解决基于视觉语言模型(VLM)的移动代理在图形用户界面(GUI)任务中因缺乏多样化的Chain of Action-Planning Thoughts (CoaT)轨迹而导致的表达能力和泛化能力不足的问题。现有方法在应对数据稀缺性时,要么忽略中间推理步骤的正确性,要么依赖昂贵的过程级标注来构建过程奖励模型(PRM)。论文提出的解决方案关键在于一种迭代偏好学习(Iterative Preference Learning, IPL),通过迭代采样构建CoaT树,利用基于规则的奖励对叶节点进行评分,并反向传播反馈以生成Thinking-level Direct Preference Optimization (T-DPO)对,从而提升模型的推理性能。

链接: https://arxiv.org/abs/2505.12299
作者: Kun Huang,Weikai Xu,Yuxuan Liu,Quandong Wang,Pengzhi Gao,Wei Liu,Jian Luan,Bin Wang,Bo An
机构: XiaoMi AI Lab (小米人工智能实验室); Nanyang Technological University (南洋理工大学); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures, 7 tables

点击查看摘要

Abstract:The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction evolution, which leverages GPT-4o to generate diverse Q\A pairs based on real mobile UI screenshots, enhancing both generality and layout understanding. Experiments on three standard Mobile GUI-agent benchmarks demonstrate that our agent MobileIPL outperforms strong baselines, including continual pretraining models such as OS-ATLAS and UI-TARS. It achieves state-of-the-art performance across three standard Mobile GUI-Agents benchmarks and shows strong generalization to out-of-domain scenarios.
zh

[NLP-154] he Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models

【速读】: 该论文旨在解决封闭源代码大型语言模型(LLMs)在多语言攻击场景下的安全漏洞问题,特别是针对对抗性提示注入(adversarial prompt injections)的防御能力。其解决方案的关键在于提出了一种首创的集成对抗框架,该框架结合多种攻击技术,系统地评估了包括GPT-4o、DeepSeek-R1、Gemini-1.5-Pro和Qwen-Max在内的前沿专有模型的安全性,并通过攻击成功率(ASR)从提示设计、模型架构和语言环境三个维度进行量化分析。

链接: https://arxiv.org/abs/2505.12287
作者: Linghan Huang,Haolin Jin,Zhaoge Bi,Pengyue Yang,Peizhou Zhao,Taozhao Chen,Xiongfei Wu,Lei Ma,Huaming Chen
机构: University of Sydney(悉尼大学); The University of Tokyo(东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have seen widespread applications across various domains, yet remain vulnerable to adversarial prompt injections. While most existing research on jailbreak attacks and hallucination phenomena has focused primarily on open-source models, we investigate the frontier of closed-source LLMs under multilingual attack scenarios. We present a first-of-its-kind integrated adversarial framework that leverages diverse attack techniques to systematically evaluate frontier proprietary solutions, including GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Our evaluation spans six categories of security contents in both English and Chinese, generating 38,400 responses across 32 types of jailbreak attacks. Attack success rate (ASR) is utilized as the quantitative metric to assess performance from three dimensions: prompt design, model architecture, and language environment. Our findings suggest that Qwen-Max is the most vulnerable, while GPT-4o shows the strongest defense. Notably, prompts in Chinese consistently yield higher ASRs than their English counterparts, and our novel Two-Sides attack technique proves to be the most effective across all models. This work highlights a dire need for language-aware alignment and robust cross-lingual defenses in LLMs, and we hope it will inspire researchers, developers, and policymakers toward more robust and inclusive AI systems.
zh

[NLP-155] Efficient RL Training for Reasoning Models via Length-Aware Optimization

【速读】: 该论文旨在解决大型推理模型在执行推理任务时产生的长推理路径所带来的高内存和时间成本问题。现有方法主要通过引入额外的训练数据和阶段来缩短推理路径,而本文提出的关键解决方案是将三种重要的奖励设计直接集成到大型推理模型的强化学习过程中,从而在不增加额外训练阶段的情况下减少响应长度。实验结果表明,该方法在保持或提升性能的同时显著降低了响应长度。

链接: https://arxiv.org/abs/2505.12284
作者: Danlong Yuan,Tian Xie,Shaohan Huang,Zhuocheng Gong,Huishuai Zhang,Chong Luo,Furu Wei,Dongyan Zhao
机构: Peking University (北京大学); University of Science and Technology of China (中国科学技术大学); Microsoft Research (微软研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs. Existing methods primarily aim to shorten reasoning paths by introducing additional training data and stages. In this paper, we propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models, which reduce the response length without extra training stages. Experiments on four settings show that our method significantly decreases response length while maintaining or even improving performance. Specifically, in a logic reasoning setting, we achieve a 40% reduction in response length averaged by steps alongside a 14% gain in performance. For math problems, we reduce response length averaged by steps by 33% while preserving performance.
zh

[NLP-156] LLM -Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark

【速读】: 该论文旨在解决低资源语言机器翻译(Machine Translation, MT)评估的挑战,尤其是在存在多种方言的语言中,由于语言多样性与数据稀缺性导致的评估困难。其解决方案的关键在于提出一种基于方言引导的框架,通过扩展ONUBAD数据集、引入方言特定词汇以弥补词汇缺口、添加回归头以实现标量评分预测,并设计方言引导提示策略,从而提升大语言模型(Large Language Models, LLMs)在无参考评估中的性能。

链接: https://arxiv.org/abs/2505.12273
作者: Md. Atiqur Rahman,Sabrina Islam,Mushfiqul Haque Omi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating machine translation (MT) for low-resource languages poses a persistent challenge, primarily due to the limited availability of high quality reference translations. This issue is further exacerbated in languages with multiple dialects, where linguistic diversity and data scarcity hinder robust evaluation. Large Language Models (LLMs) present a promising solution through reference-free evaluation techniques; however, their effectiveness diminishes in the absence of dialect-specific context and tailored guidance. In this work, we propose a comprehensive framework that enhances LLM-based MT evaluation using a dialect guided approach. We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers. To address the vocabulary gap, we augment the tokenizer vocabulary with dialect-specific terms. We further introduce a regression head to enable scalar score prediction and design a dialect-guided (DG) prompting strategy. Our evaluation across multiple LLMs shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation, along with improvements across other evaluation settings. The dataset and the code are available at this https URL.
zh

[NLP-157] K-MSHC: Unmasking Minimally Sufficient Head Circuits in Large Language Models with Experiments on Syntactic Classification Tasks

【速读】: 该论文试图解决中等规模语言模型(≤10B参数)中哪些神经组件驱动特定能力的问题,其解决方案的关键在于提出一种名为 (\bmK, \epsilon) -Minimum Sufficient Head Circuit (K-MSHC) 的方法,用于识别对分类任务至关重要的最小注意力头集合,并引入高效的Search-K-MSHC算法来发现这些电路。通过该方法,研究者分析了Gemma-9B模型在三种句法任务家族中的表现,揭示了任务特异性头电路的分布特征及其非线性重叠模式。

链接: https://arxiv.org/abs/2505.12268
作者: Pratim Chowdhary
机构: Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding which neural components drive specific capabilities in mid-sized language models ( \leq 10B parameters) remains a key challenge. We introduce the (\bmK, \epsilon) -Minimum Sufficient Head Circuit ( K -MSHC), a methodology to identify minimal sets of attention heads crucial for classification tasks as well as Search-K-MSHC, an efficient algorithm for discovering these circuits. Applying our Search-K-MSHC algorithm to Gemma-9B, we analyze three syntactic task families: grammar acceptability, arithmetic verification, and arithmetic word problems. Our findings reveal distinct task-specific head circuits, with grammar tasks predominantly utilizing early layers, word problems showing pronounced activity in both shallow and deep regions, and arithmetic verification demonstrating a more distributed pattern across the network. We discover non-linear circuit overlap patterns, where different task pairs share computational components at varying levels of importance. While grammar and arithmetic share many “weak” heads, arithmetic and word problems share more consistently critical “strong” heads. Importantly, we find that each task maintains dedicated “super-heads” with minimal cross-task overlap, suggesting that syntactic and numerical competencies emerge from specialized yet partially reusable head circuits.
zh

[NLP-158] Learning Auxiliary Tasks Improves Reference-Free Hallucination Detection in Open-Domain Long-Form Generation

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在开放领域长文本生成中产生的幻觉(hallucination)检测问题,即模型生成与事实不符的信息。现有方法要么局限于特定领域,要么依赖外部事实核查工具,而这些工具可能无法始终可用。论文的关键解决方案是通过引入一种新的范式——RATE-FT,该方法在微调(fine-tuning)基础上增加了一个辅助任务,使模型能够联合学习主要的幻觉检测任务,从而提升检测准确性。实验结果表明,该方法在LongFact数据集上相比通用微调方法提升了3%。

链接: https://arxiv.org/abs/2505.12265
作者: Chengwei Qin,Wenxuan Zhou,Karthik Abinav Sankararaman,Nanshu Wang,Tengyu Xu,Alexander Radovic,Eryk Helenowski,Arya Talebzadeh,Aditya Tayade,Sinong Wang,Shafiq Joty,Han Fang,Hao Ma
机构: HKUST(GZ); NTU; GenAI; Meta
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucination, the generation of factually incorrect information, remains a significant challenge for large language models (LLMs), especially in open-domain long-form generation. Existing approaches for detecting hallucination in long-form tasks either focus on limited domains or rely heavily on external fact-checking tools, which may not always be available. In this work, we systematically investigate reference-free hallucination detection in open-domain long-form responses. Our findings reveal that internal states (e.g., model’s output probability and entropy) alone are insufficient for reliably (i.e., better than random guessing) distinguishing between factual and hallucinated content. To enhance detection, we explore various existing approaches, including prompting-based methods, probing, and fine-tuning, with fine-tuning proving the most effective. To further improve the accuracy, we introduce a new paradigm, named RATE-FT, that augments fine-tuning with an auxiliary task for the model to jointly learn with the main task of hallucination detection. With extensive experiments and analysis using a variety of model families datasets, we demonstrate the effectiveness and generalizability of our method, e.g., +3% over general fine-tuning methods on LongFact. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.12265 [cs.CL] (or arXiv:2505.12265v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.12265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-159] LightRetriever: A LLM -based Hybrid Retrieval Architecture with 1000x Faster Query Inference

【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的混合检索系统中查询编码效率低下的问题,特别是在实时查询场景下,深度参数化的LLMs导致查询推理吞吐量下降和在线部署资源需求增加。解决方案的关键在于设计一种轻量级的查询编码器,仅需执行嵌入查找操作即可完成查询编码,而保留全尺寸LLM用于文档编码,从而显著提升查询推理速度并降低资源消耗。

链接: https://arxiv.org/abs/2505.12260
作者: Guangyuan Ma,Yongliang Ma,Xuanrui Gou,Zhenpeng Su,Ming Zhou,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Langboat Technology(朗博科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs)-based hybrid retrieval uses LLMs to encode queries and documents into low-dimensional dense or high-dimensional sparse vectors. It retrieves documents relevant to search queries based on vector similarities. Documents are pre-encoded offline, while queries arrive in real-time, necessitating an efficient online query encoder. Although LLMs significantly enhance retrieval capabilities, serving deeply parameterized LLMs slows down query inference throughput and increases demands for online deployment resources. In this paper, we propose LightRetriever, a novel LLM-based hybrid retriever with extremely lightweight query encoders. Our method retains a full-sized LLM for document encoding, but reduces the workload of query encoding to no more than an embedding lookup. Compared to serving a full-sized LLM on an H800 GPU, our approach achieves over a 1000x speedup for query inference with GPU acceleration, and even a 20x speedup without GPU. Experiments on large-scale retrieval benchmarks demonstrate that our method generalizes well across diverse retrieval tasks, retaining an average of 95% full-sized performance.
zh

[NLP-160] ach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)评估方法发展滞后的问题,传统基准测试依赖于任务特定指标和静态数据集,存在公平性不足、可扩展性有限和数据污染风险。其解决方案的关键在于提出Teach2Eval框架,该框架受费曼技巧启发,通过让模型教授较弱的学生模型来间接评估模型的多种能力,将开放性任务转化为标准化多项选择题,从而实现可扩展、自动化且多维度的评估。该方法有效避免了数据泄露和记忆问题,并捕捉了与现有基准正交的广泛认知能力。

链接: https://arxiv.org/abs/2505.12259
作者: Yuhang Zhou,Xutian Chen,Yixin Cao,Yuchen Ni,Yu He,Siyu Tian,Xiang Liu,Jian Zhang,Chuanjun Ji,Guangnan Ye,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); NYU Shanghai (纽约大学上海分校); DataGrand Inc. (数据智能公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has outpaced the development of effective evaluation methods. Traditional benchmarks rely on task-specific metrics and static datasets, which often suffer from fairness issues, limited scalability, and contamination risks. In this paper, we introduce Teach2Eval, an indirect evaluation framework inspired by the Feynman Technique. Instead of directly testing LLMs on predefined tasks, our method evaluates a model’s multiple abilities to teach weaker student models to perform tasks effectively. By converting open-ended tasks into standardized multiple-choice questions (MCQs) through teacher-generated feedback, Teach2Eval enables scalable, automated, and multi-dimensional assessment. Our approach not only avoids data leakage and memorization but also captures a broad range of cognitive abilities that are orthogonal to current benchmarks. Experimental results across 26 leading LLMs show strong alignment with existing human and model-based dynamic rankings, while offering additional interpretability for training guidance.
zh

[NLP-161] Not All Documents Are What You Need for Extracting Instruction Tuning Data

【速读】: 该论文试图解决指令微调(instruction tuning)中训练数据质量与多样性不足的问题,特别是在使用大型语言模型(LLMs)合成指令数据时,生成的指令往往缺乏多样性且与输入种子相似,限制了其在实际场景中的应用。解决方案的关键在于提出EQUAL框架,该框架通过迭代交替文档选择与高质量问答对(QA pair)提取的方式,从包含丰富知识的网络语料库中高效提取适用于指令微调的数据,从而降低计算成本并提升模型性能。

链接: https://arxiv.org/abs/2505.12250
作者: Chi Zhang,Huaping Zhong,Hongtao Li,Chengliang Chai,Jiawei Hong,Yuhao Deng,Jiacheng Wang,Tian Tan,Yizhou Yan,Jiantao Qiu,Ye Yuan,Guoren Wang,Conghui He,Lei Cao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction tuning improves the performance of large language models (LLMs), but it heavily relies on high-quality training data. Recently, LLMs have been used to synthesize instruction data using seed question-answer (QA) pairs. However, these synthesized instructions often lack diversity and tend to be similar to the input seeds, limiting their applicability in real-world scenarios. To address this, we propose extracting instruction tuning data from web corpora that contain rich and diverse knowledge. A naive solution is to retrieve domain-specific documents and extract all QA pairs from them, but this faces two key challenges: (1) extracting all QA pairs using LLMs is prohibitively expensive, and (2) many extracted QA pairs may be irrelevant to the downstream tasks, potentially degrading model performance. To tackle these issues, we introduce EQUAL, an effective and scalable data extraction framework that iteratively alternates between document selection and high-quality QA pair extraction to enhance instruction tuning. EQUAL first clusters the document corpus based on embeddings derived from contrastive learning, then uses a multi-armed bandit strategy to efficiently identify clusters that are likely to contain valuable QA pairs. This iterative approach significantly reduces computational cost while boosting model performance. Experiments on AutoMathText and StackOverflow across four downstream tasks show that EQUAL reduces computational costs by 5-10x and improves accuracy by 2.5 percent on LLaMA-3.1-8B and Mistral-7B
zh

[NLP-162] Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce

【速读】: 该论文试图解决如何系统性地理解生成式语言模型(Language Models, LMs)所生成的概率分布,以及不同分布的可诱导性差异问题。其解决方案的关键在于通过软或硬的基于梯度的提示调优方法,寻找能够使LM输出分布尽可能接近目标分布的提示(prompt),从而分析不同特性分布的诱导难度。研究揭示了低熵或高熵分布、包含“异常词”(outlier tokens)的分布以及由LM自身生成的目标分布更容易被近似这一现象,为理解LM的表达能力及作为概率分布提案者的挑战提供了重要见解。

链接: https://arxiv.org/abs/2505.12244
作者: Haojin Wang,Zining Zhu,Freda Shi
机构: University of Waterloo (滑铁卢大学); Stevens Institute of Technology (斯蒂文斯理工学院); Vector Institute (向量研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autoregressive neural language models (LMs) generate a probability distribution over tokens at each time step given a prompt. In this work, we attempt to systematically understand the probability distributions that LMs can produce, showing that some distributions are significantly harder to elicit than others. Specifically, for any target next-token distribution over the vocabulary, we attempt to find a prompt that induces the LM to output a distribution as close as possible to the target, using either soft or hard gradient-based prompt tuning. We find that (1) in general, distributions with very low or very high entropy are easier to approximate than those with moderate entropy; (2) among distributions with the same entropy, those containing ‘‘outlier tokens’’ are easier to approximate; (3) target distributions generated by LMs – even LMs with different tokenizers – are easier to approximate than randomly chosen targets. These results offer insights into the expressiveness of LMs and the challenges of using them as probability distribution proposers.
zh

[NLP-163] PANORAMA: A synthetic PII-laced dataset for studying sensitive data memorization in LLM s

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在训练过程中对敏感和个人可识别信息(Personally Identifiable Information, PII)的过度记忆所带来的隐私风险问题。现有研究在分析PII数据的记忆现象及开发缓解策略时面临缺乏全面、真实且符合伦理的数据集的挑战。论文提出的解决方案是构建一个名为PANORAMA的大型合成语料库,其关键在于通过约束选择生成内部一致的多属性虚拟人物档案,并利用零样本提示和OpenAI o3-mini模型生成多种类型的在线内容,以模拟网络环境中PII和敏感信息的分布、多样性和上下文特性,从而为隐私风险评估和模型审计提供可靠的数据支持。

链接: https://arxiv.org/abs/2505.12238
作者: Sriram Selvam,Anneswa Ghosh
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The memorization of sensitive and personally identifiable information (PII) by large language models (LLMs) poses growing privacy risks as models scale and are increasingly deployed in real-world applications. Existing efforts to study sensitive and PII data memorization and develop mitigation strategies are hampered by the absence of comprehensive, realistic, and ethically sourced datasets reflecting the diversity of sensitive information found on the web. We introduce PANORAMA - Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis, a large-scale synthetic corpus of 384,789 samples derived from 9,674 synthetic profiles designed to closely emulate the distribution, variety, and context of PII and sensitive data as it naturally occurs in online environments. Our data generation pipeline begins with the construction of internally consistent, multi-attribute human profiles using constrained selection to reflect real-world demographics such as education, health attributes, financial status, etc. Using a combination of zero-shot prompting and OpenAI o3-mini, we generate diverse content types - including wiki-style articles, social media posts, forum discussions, online reviews, comments, and marketplace listings - each embedding realistic, contextually appropriate PII and other sensitive information. We validate the utility of PANORAMA by fine-tuning the Mistral-7B model on 1x, 5x, 10x, and 25x data replication rates with a subset of data and measure PII memorization rates - revealing not only consistent increases with repetition but also variation across content types, highlighting PANORAMA’s ability to model how memorization risks differ by context. Our dataset and code are publicly available, providing a much-needed resource for privacy risk assessment, model auditing, and the development of privacy-preserving LLMs.
zh

[NLP-164] Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training IJCAI2025

【速读】: 该论文旨在解决少样本关系抽取(Few-Shot Relation Extraction, FSRE)任务中由于标注数据稀缺和现有模型泛化能力有限所带来的挑战。其解决方案的关键在于提出TKRE(Two-Stage Knowledge-Guided Pre-training for Relation Extraction)框架,该框架通过将大型语言模型(LLMs)与传统关系抽取模型相结合,融合生成式与判别式学习范式。TKRE的核心创新包括:利用LLMs生成基于解释的知识和符合模式约束的合成数据以缓解数据稀缺问题,以及采用两阶段预训练策略,结合掩码跨度语言建模(Masked Span Language Modeling, MSLM)和跨度级对比学习(Span-Level Contrastive Learning, SCL),以提升关系推理和泛化能力。

链接: https://arxiv.org/abs/2505.12236
作者: Quanjiang Guo,Jinchuan Zhang,Sijie Wang,Ling Tian,Zhao Kang,Bin Yan,Weidong Xiao
机构: University of Electronic Science and Technology of China (中国电子科技大学); Nanyang Technological University (南洋理工大学); Information Engineering University (信息工程大学); National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 6 figures, Appear on IJCAI 2025

点击查看摘要

Abstract:Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have demonstrated potential in FSRE through in-context learning (ICL), their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively tackle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. \footnoteThe code and data are released on this https URL.
zh

[NLP-165] Reward Inside the Model: A Lightweight Hidden-State Reward Model for LLM s Best-of-N sampling

【速读】: 该论文旨在解决现有奖励模型在计算成本高、参数量大以及难以应用于封闭源代码大语言模型(Large Language Models, LLMs)等问题。其解决方案的关键在于提出一种高效线性隐藏状态奖励(Efficient Linear Hidden State Reward, ELHSR)模型,该模型通过利用LLM隐藏状态中丰富的信息,实现了极低参数量下的优越性能,并且仅需少量样本即可完成训练,同时在计算效率和适用性方面表现出显著优势。

链接: https://arxiv.org/abs/2505.12225
作者: Jizhou Guo,Zhaomin Wu,Philip S. Yu
机构: Zhiyuan College, Shanghai Jiao Tong University (智源学院,上海交通大学); National University of Singapore (新加坡国立大学); University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden State Reward (ELHSR) model - a novel, highly parameter-efficient approach that leverages the rich information embedded in LLM hidden states to address these issues. ELHSR systematically outperform baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training. ELHSR also achieves orders-of-magnitude efficiency improvement with significantly less time and fewer FLOPs per sample than baseline reward models. Moreover, ELHSR exhibits robust performance even when trained only on logits, extending its applicability to some closed-source LLMs. In addition, ELHSR can also be combined with traditional reward models to achieve additional performance gains.
zh

[NLP-166] Examining Linguistic Shifts in Academic Writing Before and After the Launch of ChatGPT : A Study on Preprint Papers

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)对学术写作语言特征的潜在影响这一研究空白。其解决方案的关键在于通过大规模分析近十年来自arXiv数据集的823,798篇摘要,系统性地考察了LLMs在词汇偏好、词汇复杂度、句法复杂度、连贯性、可读性和情感等方面的影响,从而揭示LLMs在学术写作中的广泛渗透及其对语言特征的多方面改变。

链接: https://arxiv.org/abs/2505.12218
作者: Tong Bao,Yi Zhao,Jin Mao,Chengzhi Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as ChatGPT, have prompted academic concerns about their impact on academic writing. Existing studies have primarily examined LLM usage in academic writing through quantitative approaches, such as word frequency statistics and probability-based analyses. However, few have systematically examined the potential impact of LLMs on the linguistic characteristics of academic writing. To address this gap, we conducted a large-scale analysis across 823,798 abstracts published in last decade from arXiv dataset. Through the linguistic analysis of features such as the frequency of LLM-preferred words, lexical complexity, syntactic complexity, cohesion, readability and sentiment, the results indicate a significant increase in the proportion of LLM-preferred words in abstracts, revealing the widespread influence of LLMs on academic writing. Additionally, we observed an increase in lexical complexity and sentiment in the abstracts, but a decrease in syntactic complexity, suggesting that LLMs introduce more new vocabulary and simplify sentence structure. However, the significant decrease in cohesion and readability indicates that abstracts have fewer connecting words and are becoming more difficult to read. Moreover, our analysis reveals that scholars with weaker English proficiency were more likely to use the LLMs for academic writing, and focused on improving the overall logic and fluency of the abstracts. Finally, at discipline level, we found that scholars in Computer Science showed more pronounced changes in writing style, while the changes in Mathematics were minimal.
zh

[NLP-167] One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models ACL

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)剪枝方法在处理多用户同时压缩请求时效率低下的问题,现有方法虽然在单个请求上表现良好,但其处理时间随请求数量线性增加,难以满足实际应用场景的需求。解决方案的关键在于提出一种通用的定制化压缩模型(UniCuCo),其核心是引入StratNet,通过学习将任意压缩请求映射到最优的剪枝策略。为了解决StratNet训练中的计算成本高和剪枝过程不可微的问题,论文采用高斯过程(Gaussian Process)近似评估过程,并利用其可计算的梯度来替代不可微剪枝过程的梯度,从而实现StratNet的有效更新。

链接: https://arxiv.org/abs/2505.12216
作者: Rongguang Ye,Ming Tang
机构: Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL)
备注: ACL Findings

点击查看摘要

Abstract:Existing pruning methods for large language models (LLMs) focus on achieving high compression rates while maintaining model performance. Although these methods have demonstrated satisfactory performance in handling a single user’s compression request, their processing time increases linearly with the number of requests, making them inefficient for real-world scenarios with multiple simultaneous requests. To address this limitation, we propose a Univeral Model for Customized Compression (UniCuCo) for LLMs, which introduces a StratNet that learns to map arbitrary requests to their optimal pruning strategy. The challenge in training StratNet lies in the high computational cost of evaluating pruning strategies and the non-differentiable nature of the pruning process, which hinders gradient backpropagation for StratNet updates. To overcome these challenges, we leverage a Gaussian process to approximate the evaluation process. Since the gradient of the Gaussian process is computable, we can use it to approximate the gradient of the non-differentiable pruning process, thereby enabling StratNet updates. Experimental results show that UniCuCo is 28 times faster than baselines in processing 64 requests, while maintaining comparable accuracy to baselines.
zh

[NLP-168] GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文场景中面临的计算效率低和冗余信息多的问题。其解决方案的关键在于提出一种基于编码器-解码器架构的上下文压缩框架GMSA,该框架通过减少输入序列长度和冗余信息来提升性能。GMSA的核心组件包括Group Merging和Layer Semantic Alignment (LSA),前者用于高效提取上下文的摘要向量,后者则用于对齐高层摘要向量与底层原始语义,从而弥合不同层间的语义差距。

链接: https://arxiv.org/abs/2505.12215
作者: Jiwei Tang,Zhicheng Zhang,Shunlong Wu,Jingheng Ye,Lichen Bai,Zitai Wang,Tingwei Lu,Jiaqi Chen,Lin Hai,Hai-Tao Zheng,Hong-Gee Kim
机构: Tsinghua University (清华大学); Pengcheng Laboratory (鹏城实验室); Seoul National University (首尔大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive performance in a variety of natural language processing (NLP) tasks. However, when applied to long-context scenarios, they face two challenges, i.e., low computational efficiency and much redundant information. This paper introduces GMSA, a context compression framework based on the encoder-decoder architecture, which addresses these challenges by reducing input sequence length and redundant information. Structurally, GMSA has two key components: Group Merging and Layer Semantic Alignment (LSA). Group merging is used to effectively and efficiently extract summary vectors from the original context. Layer semantic alignment, on the other hand, aligns the high-level summary vectors with the low-level primary input semantics, thus bridging the semantic gap between different layers. In the training process, GMSA first learns soft tokens that contain complete semantics through autoencoder training. To furtherly adapt GMSA to downstream tasks, we propose Knowledge Extraction Fine-tuning (KEFT) to extract knowledge from the soft tokens for downstream tasks. We train GMSA by randomly sampling the compression rate for each sample in the dataset. Under this condition, GMSA not only significantly outperforms the traditional compression paradigm in context restoration but also achieves stable and significantly faster convergence with only a few encoder layers. In downstream question-answering (QA) tasks, GMSA can achieve approximately a 2x speedup in end-to-end inference while outperforming both the original input prompts and various state-of-the-art (SOTA) methods by a large margin.
zh

[NLP-169] Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning ACL2025

【速读】: 该论文试图解决在微调大型语言模型(Large Language Models, LLMs)时,如何高效选择最优子集以平衡性能与计算成本的问题。传统数据选择方法通常需要在目标数据集上微调评分模型,这既耗时又资源密集,或者依赖启发式方法,未能充分利用模型的预测能力。解决方案的关键在于提出Data Whisperer,这是一种无需训练、基于注意力机制的方法,通过利用待微调模型的少样本上下文学习能力来实现高效的数据选择。

链接: https://arxiv.org/abs/2505.12212
作者: Shaobo Wang,Ziming Wang,Xiangqi Jin,Jize Wang,Jiajun Zhang,Kaixin Li,Zichen Wen,Zhong Li,Conghui He,Xuming Hu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); EPIC Lab, SJTU (EPIC实验室,上海交通大学); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Microsoft Research Asia (亚洲微软研究院); Shanghai AI Laboratory (上海人工智能实验室); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 main, 18 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model’s predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4 \times speedup.
zh

[NLP-170] How Reliable is Multilingual LLM -as-a-Judge?

【速读】: 该论文试图解决生成式 AI (Generative AI) 在多语言评估中的一致性问题,即当前大型语言模型作为评判者(LLM-as-a-Judge)在不同语言间的判断结果缺乏稳定性。研究发现,模型在多语言任务中的平均Fleiss’ Kappa值仅为约0.3,部分模型表现更差,表明其可靠性不足。解决方案的关键在于提出一种集成策略,以提高多语言评判器在实际应用中的判断一致性。

链接: https://arxiv.org/abs/2505.12201
作者: Xiyan Fu,Wei Liu
机构: Heidelberg University (海德堡大学); Heidelberg Institute for Theoretical Studies gGmbH (海德堡理论研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss’ Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource languages. Additionally, we find that neither training on multilingual data nor increasing model scale directly improves judgment consistency. These findings suggest that LLMs are not yet reliable for evaluating multilingual predictions. We finally propose an ensemble strategy which improves the consistency of the multilingual judge in real-world applications.
zh

[NLP-171] Vectors from Larger Language Models Predict Human Reading Time and fMRI Data More Poorly when Dimensionality Expansion is Controlled

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在预测人类阅读时间和脑成像数据时表现不佳的问题,特别是探讨模型规模扩大是否能持续提升其与人类句法处理机制的匹配度。解决方案的关键在于使用完整的LLM向量进行评估,并控制不同规模模型中预测因子数量的差异,从而更准确地衡量模型性能随规模变化的趋势。研究结果表明,随着模型规模增大,预测效果反而下降,暗示LLMs与人类句法处理机制之间存在显著的不匹配,且这种不匹配在使用更大模型时更加严重。

链接: https://arxiv.org/abs/2505.12196
作者: Yi-Chien Lin,Hongao Zhu,William Schuler
机构: The Ohio State University (俄亥俄州立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The impressive linguistic abilities of large language models (LLMs) have recommended them as models of human sentence processing, with some conjecturing a positive ‘quality-power’ relationship (Wilcox et al., 2023), in which language models’ (LMs’) fit to psychometric data continues to improve as their ability to predict words in context increases. This is important because it suggests that elements of LLM architecture, such as veridical attention to context and a unique objective of predicting upcoming words, reflect the architecture of the human sentence processing faculty, and that any inadequacies in predicting human reading time and brain imaging data may be attributed to insufficient model complexity, which recedes as larger models become available. Recent studies (Oh and Schuler, 2023) have shown this scaling inverts after a point, as LMs become excessively large and accurate, when word prediction probability (as information-theoretic surprisal) is used as a predictor. Other studies propose the use of entire vectors from differently sized LLMs, still showing positive scaling (Schrimpf et al., 2021), casting doubt on the value of surprisal as a predictor, but do not control for the larger number of predictors in vectors from larger LMs. This study evaluates LLM scaling using entire LLM vectors, while controlling for the larger number of predictors in vectors from larger LLMs. Results show that inverse scaling obtains, suggesting that inadequacies in predicting human reading time and brain imaging data may be due to substantial misalignment between LLMs and human sentence processing, which worsens as larger models are used.
zh

[NLP-172] Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中存在的内容偏见问题,即模型常常将内容合理性(material inference)与逻辑有效性(formal inference)混淆,导致推理结果出现偏差。解决方案的关键在于通过激活控制(activation steering)技术对模型的推理过程进行干预,特别是通过对比激活控制方法实现对内容偏见的线性调控。研究进一步提出基于条件调整的动态控制策略,以提升模型在不同任务中的表现,其中引入的kNN-based方法(K-CAST)在提升逻辑推理准确性方面表现出显著效果。

链接: https://arxiv.org/abs/2505.12189
作者: Marco Valentino,Geonhee Kim,Dhairya Dalal,Zhixue Zhao,André Freitas
机构: University of Sheffield (谢菲尔德大学); Idiap Research Institute (Idiap研究机构); University of Manchester (曼彻斯特大学); University of Galway (高威大学); National Biomarker Centre, CRUK-MI (国家生物标志物中心,CRUK-MI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) frequently demonstrate reasoning limitations, often conflating content plausibility (i.e., material inference) with logical validity (i.e., formal inference). This can result in biased inferences, where plausible arguments are incorrectly deemed logically valid or vice versa. Mitigating this limitation is critical, as it undermines the trustworthiness and generalizability of LLMs in applications that demand rigorous logical consistency. This paper investigates the problem of mitigating content biases on formal reasoning through activation steering. Specifically, we curate a controlled syllogistic reasoning dataset to disentangle formal validity from content plausibility. After localising the layers responsible for formal and material inference, we investigate contrastive activation steering methods for test-time interventions. An extensive empirical analysis on different LLMs reveals that contrastive steering consistently supports linear control over content biases. However, we observe that a static approach is insufficient for improving all the tested models. We then leverage the possibility to control content effects by dynamically determining the value of the steering parameters via fine-grained conditional methods. We found that conditional steering is effective on unresponsive models, achieving up to 15% absolute improvement in formal reasoning accuracy with a newly introduced kNN-based method (K-CAST). Finally, additional experiments reveal that steering for content effects is robust to prompt variations, incurs minimal side effects on language modeling capabilities, and can partially generalize to out-of-distribution reasoning tasks. Practically, this paper demonstrates that activation-level interventions can offer a scalable strategy for enhancing the robustness of LLMs, contributing towards more systematic and unbiased formal reasoning.
zh

[NLP-173] EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective

【速读】: 该论文试图解决当前评估大型语言模型(Large Language Models, LLMs)编程能力时,过于依赖静态基准测试的准确性而忽视模型在编程任务中的鲁棒性问题。现有对抗攻击方法在评估鲁棒性时效果有限且结果不一致,难以实现跨不同LLMs的统一评估。论文提出的解决方案是EVALOOP框架,其关键在于从自洽性角度评估鲁棒性,即利用软件工程任务中固有的双重性(如代码生成与代码摘要),通过构建一个自包含的反馈循环机制,使LLM生成的输出作为下一轮输入,从而在无需外部攻击设置的情况下,周期性地评估模型的鲁棒性,并提供统一的评估指标。

链接: https://arxiv.org/abs/2505.12185
作者: Sen Fang,Weiyuan Ding,Bowen Xu
机构: NC State University (北卡罗来纳州立大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Assessing the programming capabilities of Large Language Models (LLMs) is crucial for their effective use in software engineering. Current evaluations, however, predominantly measure the accuracy of generated code on static benchmarks, neglecting the critical aspect of model robustness during programming tasks. While adversarial attacks offer insights on model robustness, their effectiveness is limited and evaluation could be constrained. Current adversarial attack methods for robustness evaluation yield inconsistent results, struggling to provide a unified evaluation across different LLMs. We introduce EVALOOP, a novel assessment framework that evaluate the robustness from a self-consistency perspective, i.e., leveraging the natural duality inherent in popular software engineering tasks, e.g., code generation and code summarization. EVALOOP initiates a self-contained feedback loop: an LLM generates output (e.g., code) from an input (e.g., natural language specification), and then use the generated output as the input to produce a new output (e.g., summarizes that code into a new specification). EVALOOP repeats the process to assess the effectiveness of EVALOOP in each loop. This cyclical strategy intrinsically evaluates robustness without rely on any external attack setups, providing a unified metric to evaluate LLMs’ robustness in programming. We evaluate 16 prominent LLMs (e.g., GPT-4.1, O4-mini) on EVALOOP and found that EVALOOP typically induces a 5.01%-19.31% absolute drop in pass@1 performance within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, GPT-3.5-Turbo, despite superior initial code generation compared to DeepSeek-V2, demonstrated lower robustness over repeated evaluation loop.
zh

[NLP-174] Decoding the Mind of Large Language Models : A Quantitative Evaluation of Ideology and Biases

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在意识形态偏见和伦理问题上的评估难题,旨在通过实证研究揭示其内在偏见、思维模式及社会影响,以确保其伦理性和有效性。解决方案的关键在于提出一种新颖的评估框架,该框架通过量化分析436个二元选择问题(其中许多问题没有明确答案),揭示LLMs的意识形态偏见,并发现不同模型和语言之间的观点差异。此框架为评估LLMs的行为提供了一种灵活且定量的方法,有助于开发更符合社会价值观的人工智能系统。

链接: https://arxiv.org/abs/2505.12183
作者: Manari Hirose,Masato Uchida
机构: Waseda University (早稻田大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 23 pages, 5 figures, 17 tables

点击查看摘要

Abstract:The widespread integration of Large Language Models (LLMs) across various sectors has highlighted the need for empirical research to understand their biases, thought patterns, and societal implications to ensure ethical and effective use. In this study, we propose a novel framework for evaluating LLMs, focusing on uncovering their ideological biases through a quantitative analysis of 436 binary-choice questions, many of which have no definitive answer. By applying our framework to ChatGPT and Gemini, findings revealed that while LLMs generally maintain consistent opinions on many topics, their ideologies differ across models and languages. Notably, ChatGPT exhibits a tendency to change their opinion to match the questioner’s opinion. Both models also exhibited problematic biases, unethical or unfair claims, which might have negative societal impacts. These results underscore the importance of addressing both ideological and ethical considerations when evaluating LLMs. The proposed framework offers a flexible, quantitative method for assessing LLM behavior, providing valuable insights for the development of more socially aligned AI systems.
zh

[NLP-175] ruth Neurons

【速读】: 该论文试图解决语言模型在生成回答时可能产生不真实内容的问题,这一问题限制了其可靠性和安全性。论文提出的关键解决方案是识别语言模型中编码真实性(truthfulness)的神经元级表示,即“truth neurons”,这些神经元以与主体无关的方式编码真实性信息。通过实验验证了不同规模模型中truth neurons的存在,并证明了其在多个基准测试中的重要性,表明真实性机制并非局限于特定数据集。

链接: https://arxiv.org/abs/2505.12182
作者: Haohang Li,Yupeng Cao,Yangyang Yu,Jordan W. Suchow,Zining Zhu
机构: Stevens Institute of Technology (斯蒂文斯理工学院); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite their remarkable success and deployment across diverse workflows, language models sometimes produce untruthful responses. Our limited understanding of how truthfulness is mechanistically encoded within these models jeopardizes their reliability and safety. In this paper, we propose a method for identifying representations of truthfulness at the neuron level. We show that language models contain truth neurons, which encode truthfulness in a subject-agnostic manner. Experiments conducted across models of varying scales validate the existence of truth neurons, confirming that the encoding of truthfulness at the neuron level is a property shared by many language models. The distribution patterns of truth neurons over layers align with prior findings on the geometry of truthfulness. Selectively suppressing the activations of truth neurons found through the TruthfulQA dataset degrades performance both on TruthfulQA and on other benchmarks, showing that the truthfulness mechanisms are not tied to a specific dataset. Our results offer novel insights into the mechanisms underlying truthfulness in language models and highlight potential directions toward improving their trustworthiness and reliability.
zh

[NLP-176] Emotion Recognition for Low-Resource Turkish: Fine-Tuning BERTurk on TREMO and Testing on Xenophobic Political Discourse

【速读】: 该论文试图解决在土耳其语社交媒体中识别和分析情绪的问题,特别是在叙利亚难民涌入背景下反难民情绪的上升。解决方案的关键在于开发一种针对土耳其语的先进情感识别模型(Emotion Recognition Model, ERM),利用BERTurk和TREMO数据集,实现了92.62%的分类准确率,能够有效识别包括快乐、恐惧、愤怒、悲伤、厌恶和惊讶在内的多种情绪。该模型为计算社会科学研究提供了支持,尤其是在资源较少的语言中提升情感分析能力,并为市场营销、公共关系和危机管理等实际应用提供了实时情绪分析工具。

链接: https://arxiv.org/abs/2505.12160
作者: Darmawan Wicaksono,Hasri Akbar Awal Rozaq,Nevfel Boz
机构: Social Sciences University of Ankara (ASBU); Gazi University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Social media platforms like X (formerly Twitter) play a crucial role in shaping public discourse and societal norms. This study examines the term Sessiz Istila (Silent Invasion) on Turkish social media, highlighting the rise of anti-refugee sentiment amidst the Syrian refugee influx. Using BERTurk and the TREMO dataset, we developed an advanced Emotion Recognition Model (ERM) tailored for Turkish, achieving 92.62% accuracy in categorizing emotions such as happiness, fear, anger, sadness, disgust, and surprise. By applying this model to large-scale X data, the study uncovers emotional nuances in Turkish discourse, contributing to computational social science by advancing sentiment analysis in underrepresented languages and enhancing our understanding of global digital discourse and the unique linguistic challenges of Turkish. The findings underscore the transformative potential of localized NLP tools, with our ERM model offering practical applications for real-time sentiment analysis in Turkish-language contexts. By addressing critical areas, including marketing, public relations, and crisis management, these models facilitate improved decision-making through timely and accurate sentiment tracking. This highlights the significance of advancing research that accounts for regional and linguistic nuances.
zh

[NLP-177] he AI Gap: How Socioeconomic Status Affects Language Technology Interactions ACL

【速读】: 该论文试图解决社会经济地位(Socioeconomic Status, SES)对语言技术使用差异的影响问题,特别是生成式语言技术在不同SES群体间的使用模式和语言特征差异。其解决方案的关键在于通过实证调查收集了1,000名来自不同SES背景个体的语言技术使用数据,并分析了他们与大型语言模型(Large Language Models, LLMs)交互的6,482条提示,从而揭示了SES在语言使用频率、任务类型、交互风格及话题选择等方面的系统性差异。研究结果表明,SES差异导致了语言技术使用中的结构性不平等,强调了在开发语言技术时需考虑SES因素以减少数字鸿沟和AI差距。

链接: https://arxiv.org/abs/2505.12158
作者: Elisa Bassignana,Amanda Cercas Curry,Dirk Hovy
机构: IT University of Copenhagen (丹麦技术大学); CENTAI Institute; Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL Main 2025

点击查看摘要

Abstract:Socioeconomic status (SES) fundamentally influences how people interact with each other and more recently, with digital technologies like Large Language Models (LLMs). While previous research has highlighted the interaction between SES and language technology, it was limited by reliance on proxy metrics and synthetic data. We survey 1,000 individuals from diverse socioeconomic backgrounds about their use of language technologies and generative AI, and collect 6,482 prompts from their previous interactions with LLMs. We find systematic differences across SES groups in language technology usage (i.e., frequency, performed tasks), interaction styles, and topics. Higher SES entails a higher level of abstraction, convey requests more concisely, and topics like ‘inclusivity’ and ‘travel’. Lower SES correlates with higher anthropomorphization of LLMs (using ‘‘hello’’ and ‘‘thank you’’) and more concrete language. Our findings suggest that while generative language technologies are becoming more accessible to everyone, socioeconomic linguistic differences still stratify their use to exacerbate the digital divide. These differences underscore the importance of considering SES in developing language technologies to accommodate varying linguistic needs rooted in socioeconomic factors and limit the AI Gap across SES groups.
zh

[NLP-178] LLM -BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在交互式环境中的规划与推理能力评估问题,以推动具备高效智能的AI代理的发展。其解决方案的关键在于提出一个名为LLM-BabyBench的基准测试套件,该套件基于文本化的程序生成BabyAI网格世界,从三个核心维度评估LLMs的具身智能:预测动作对环境状态的影响(Predict任务)、生成实现目标的低级动作序列(Plan任务)以及将高层指令分解为连贯的子目标序列(Decompose任务)。通过从专家代理在文本环境中的行为中提取结构化信息来构建相应的数据集,并提供标准化的评估工具和指标,以支持多样LLMs的可重复评估。

链接: https://arxiv.org/abs/2505.12135
作者: Omar Choukrani,Idriss Malek,Daniil Orel,Zhuohan Xie,Zangir Iklassov,Martin Takáč,Salem Lahlou
机构: MBZUAI(穆巴达拉科学技术研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce \textbfLLM-BabyBench , a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ( \textbfPredict task), (2) generating sequences of low-level actions to achieve specified objectives ( \textbfPlan task), and (3) decomposing high-level instructions into coherent subgoal sequences ( \textbfDecompose task). We detail the methodology for generating the three corresponding datasets ( \textttLLM-BabyBench-Predict , \texttt-Plan , \texttt-Decompose ) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ( \hrefthis https URL\textGitHub , \hrefthis https URL\textHuggingFace ).
zh

[NLP-179] A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

【速读】: 该论文旨在解决低资源语言中网络内容审核的不足问题,特别是针对提格利尼亚语(Tigrinya)社交媒体中的侮辱性语言检测问题。其关键解决方案是构建了一个大规模的人工标注多任务基准数据集,包含13,717条YouTube评论,涵盖侮辱性、情感和主题分类三个任务,并采用迭代术语聚类方法进行有效数据选择。此外,考虑到约64%的提格利尼亚语社交媒体内容使用罗马化转写而非本地盖兹字母(Ge’ez script),数据集同时支持两种书写系统以反映实际语言使用情况。

链接: https://arxiv.org/abs/2505.12116
作者: Fitsum Gaim,Hoyun Song,Huije Lee,Changgeon Ko,Eui Jun Hwang,Jong C. Park
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Content moderation research has recently made significant advances, but still fails to serve the majority of the world’s languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge’ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments reveal that small, specialized multi-task models outperform the current frontier models in the low-resource setting, achieving up to 86% accuracy (+7 points) in abusiveness detection. We make the resources publicly available to promote research on online safety.
zh

[NLP-180] Improving Fairness in LLM s Through Testing-Time Adversaries

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成结果中存在偏见的问题,这种偏见可能影响其在涉及伦理敏感性和负责任决策任务中的应用。解决方案的关键在于通过生成给定句子的多种变体并比较其预测行为,从而检测和减轻模型中的偏见,该方法仅依赖于前向传递,无需训练、微调或对训练数据分布的先验知识。

链接: https://arxiv.org/abs/2505.12100
作者: Isabela Pereira Gregio,Ian Pons,Anna Helena Reali Costa,Artur Jordão
机构: Escola Politécnica, Universidade de São Paulo (圣保罗大学工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI, driving progress across various aspects of modern society. Unfortunately, the pervasive issue of bias in LLMs responses (i.e., predictions) poses a significant and open challenge, hindering their application in tasks involving ethical sensitivity and responsible decision-making. In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases, enhancing the reliability and trustworthiness of LLMs. Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior compared to the original, unaltered, prediction/sentence. The idea behind this process is that critical ethical predictions often exhibit notable inconsistencies, indicating the presence of bias. Unlike previous approaches, our method relies solely on forward passes (i.e., testing-time adversaries), eliminating the need for training, fine-tuning, or prior knowledge of the training data distribution. Through extensive experiments on the popular Llama family, we demonstrate the effectiveness of our method in improving various fairness metrics, focusing on the reduction of disparities in how the model treats individuals from different racial groups. Specifically, using standard metrics, we improve the fairness in Llama3 in up to 27 percentage points. Overall, our approach significantly enhances fairness, equity, and reliability in LLM-generated results without parameter tuning or training data modifications, confirming its effectiveness in practical scenarios. We believe our work establishes an important step toward enabling the use of LLMs in tasks that require ethical considerations and responsible decision-making.
zh

[NLP-181] Personalized Author Obfuscation with Large Language Models

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在通过改写和改变写作风格来隐藏作者身份时的效度问题。研究发现,尽管LLMs总体上有效,但其效度在不同用户间呈现双峰分布,表明性能差异显著。解决方案的关键在于提出一种个性化提示方法,该方法优于标准提示技术,并在一定程度上缓解了双峰分布的问题。

链接: https://arxiv.org/abs/2505.12090
作者: Mohammad Shokri,Sarah Ita Levitan,Rivka Levitan
机构: The Graduate Center (研究生中心); Hunter College (亨特学院); Brooklyn College (布鲁克林学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we investigate the efficacy of large language models (LLMs) in obfuscating authorship by paraphrasing and altering writing styles. Rather than adopting a holistic approach that evaluates performance across the entire dataset, we focus on user-wise performance to analyze how obfuscation effectiveness varies across individual authors. While LLMs are generally effective, we observe a bimodal distribution of efficacy, with performance varying significantly across users. To address this, we propose a personalized prompting method that outperforms standard prompting techniques and partially mitigates the bimodality issue.
zh

[NLP-182] Model Merging in Pre-training of Large Language Models

【速读】: 该论文旨在解决大规模预训练过程中模型合并(model merging)技术的应用与优化问题,尽管模型合并在增强大型语言模型方面展现出潜力,但其在大规模预训练中的应用仍相对未被深入探索。论文提出了一种全面的模型合并方法,在密集架构和专家混合(Mixture-of-Experts, MoE)架构上进行了大量实验,结果表明,使用恒定学习率训练的检查点进行合并不仅显著提升了模型性能,还能够准确预测退火行为,从而提高了模型开发效率并降低了训练成本。解决方案的关键在于通过系统性的消融实验分析合并策略与超参数,揭示了模型合并的潜在机制并发现了新的应用场景。

链接: https://arxiv.org/abs/2505.12082
作者: Yunshui Li,Yiyuan Ma,Shen Yan,Chaoyi Zhang,Jing Liu,Jianqiao Lu,Ziwen Xu,Mengzhao Chen,Minrui Wang,Shiyi Zhan,Jin Ma,Xunhao Lai,Yao Luo,Xingyan Bin,Hongbin Ren,Mingji Han,Wenhao Hao,Bairen Yi,LingJun Liu,Bole Ma,Xiaoying Jia,Zhou Xun,Liang Xiang,Yonghui Wu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.
zh

[NLP-183] Do different prompting methods yield a common task representation in language models?

【速读】: 该论文试图解决的问题是:不同的任务呈现方式(如示范和指令)是否会导致语言模型在上下文学习(ICL)中产生相似的任务表示。解决方案的关键在于利用函数向量(function vectors),这是一种用于提取少样本ICL任务表示的机制,并将其扩展到不同的任务呈现形式,特别是简短的文本指令提示,从而成功提取出能够提升零样本任务准确性的指令函数向量。研究发现,基于示范和指令的函数向量依赖于模型的不同组件,表明不同的任务呈现方式并未引发统一的任务表示,而是激发了部分重叠但不同的机制。

链接: https://arxiv.org/abs/2505.12075
作者: Guy Davidson,Todd M. Gureckis,Brenden M. Lake,Adina Williams
机构: FAIR at Meta (FAIR at Meta); New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 4 figures; under review

点击查看摘要

Abstract:Demonstrations and instructions are two primary approaches for prompting language models to perform in-context learning (ICL) tasks. Do identical tasks elicited in different ways result in similar representations of the task? An improved understanding of task representation mechanisms would offer interpretability insights and may aid in steering models. We study this through function vectors, recently proposed as a mechanism to extract few-shot ICL task representations. We generalize function vectors to alternative task presentations, focusing on short textual instruction prompts, and successfully extract instruction function vectors that promote zero-shot task accuracy. We find evidence that demonstration- and instruction-based function vectors leverage different model components, and offer several controls to dissociate their contributions to task performance. Our results suggest that different task presentations do not induce a common task representation but elicit different, partly overlapping mechanisms. Our findings offer principled support to the practice of combining textual instructions and task demonstrations, imply challenges in universally monitoring task inference across presentation forms, and encourage further examinations of LLM task inference mechanisms.
zh

[NLP-184] Historical and psycholinguistic perspectives on morphological productivity: A sketch of an integrative approach

【速读】: 该论文试图解决形态学产出性(morphological productivity)的评估问题,具体从认知-计算视角和历时视角进行探讨。在认知-计算视角下,研究利用判别词库模型(Discriminative Lexicon Model, DLM)分析形式空间与语义空间之间的系统性关系,以判断新词是否可被理解和生成;在历时视角下,通过分析作家托马斯·曼(Thomas Mann)的语言输入与输出,探讨其词汇生成能力的变化。解决方案的关键在于通过DLM识别词素单元与词嵌入中心点之间的关联,并基于语义嵌入距离评估词素的产出性,同时指出使用特定说话者嵌入处理低频和新词所面临的挑战。

链接: https://arxiv.org/abs/2505.12071
作者: Harald Baayen,Kristian Berg,Maziyah Mohamed
机构: University of Tübingen (图宾根大学); University of Oldenburg (奥尔登堡大学)
类目: Computation and Language (cs.CL)
备注: 35 pages, 11 figures

点击查看摘要

Abstract:In this study, we approach morphological productivity from two perspectives: a cognitive-computational perspective, and a diachronic perspective zooming in on an actual speaker, Thomas Mann. For developing the first perspective, we make use of a cognitive computational model of the mental lexicon, the discriminative lexicon model. For computational mappings between form and meaning to be productive, in the sense that novel, previously unencountered words, can be understood and produced, there must be systematicities between the form space and the semantic space. If the relation between form and meaning would be truly arbitrary, a model could memorize form and meaning pairings, but there is no way in which the model would be able to generalize to novel test data. For Finnish nominal inflection, Malay derivation, and English compounding, we explore, using the Discriminative Lexicon Model as a computational tool, to trace differences in the degree to which inflectional and word formation patterns are productive. We show that the DLM tends to associate affix-like sublexical units with the centroids of the embeddings of the words with a given affix. For developing the second perspective, we study how the intake and output of one prolific writer, Thomas Mann, changes over time. We show by means of an examination of what Thomas Mann is likely to have read, and what he wrote, that the rate at which Mann produces novel derived words is extremely low. There are far more novel words in his input than in his output. We show that Thomas Mann is less likely to produce a novel derived word with a given suffix the greater the average distance is of the embeddings of all derived words to the corresponding centroid, and discuss the challenges of using speaker-specific embeddings for low-frequency and novel words.
zh

[NLP-185] Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents

【速读】: 该论文旨在解决基于大型语言模型(Large Language Model, LLM)的搜索代理在处理复杂任务时存在的效率瓶颈问题。具体而言,现有方法在精确检索和粗粒度检索中均存在性能下降,同时系统设计中的调度不当和频繁的检索阻塞导致端到端推理延迟增加。为了解决这些问题,作者提出了SearchAgent-X,其关键在于采用高召回率的近似检索,并结合优先级感知调度和非阻塞检索两项核心技术,从而在不牺牲生成质量的前提下显著提升系统吞吐量并降低延迟。

链接: https://arxiv.org/abs/2505.12065
作者: Tiannuo Yang,Zebin Yao,Bowen Jin,Lixiao Cui,Yusen Li,Gang Wang,Xiaoguang Liu
机构: Nankai University (南开大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency – where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4 \times higher throughput and 5 \times lower latency, without compromising generation quality. SearchAgent-X is available at this https URL.
zh

[NLP-186] Why Not Act on What You Know? Unleashing Safety Potential of LLM s via Self-Aware Guard Enhancement ACL2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对精心设计的越狱攻击(jailbreak attacks)时,尽管能够检测到攻击性提示,却仍可能生成不安全响应的安全漏洞问题。解决方案的关键在于提出一种无需训练的防御策略SAGE(Self-Aware Guard Enhancement),其核心是通过两个核心组件——判别分析模块和判别响应模块——增强模型对复杂越狱尝试的抵御能力,从而将模型强大的安全判别能力与相对较弱的安全生成能力相匹配。

链接: https://arxiv.org/abs/2505.12060
作者: Peng Ding,Jun Kuang,Zongyu Wang,Xuezhi Cao,Xunliang Cai,Jiajun Chen,Shujian Huang
机构: Nanjing University (南京大学); Meituan Inc. (美团公司)
类目: Computation and Language (cs.CL)
备注: Acccepted by ACL 2025 Findings, 21 pages, 9 figures, 14 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks. In this paper, we identify a critical safety gap: while LLMs are adept at detecting jailbreak prompts, they often produce unsafe responses when directly processing these inputs. Inspired by this insight, we propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs’ strong safety discrimination performance with their relatively weaker safety generation ability. SAGE consists of two core components: a Discriminative Analysis Module and a Discriminative Response Module, enhancing resilience against sophisticated jailbreak attempts through flexible safety discrimination instructions. Extensive experiments demonstrate SAGE’s effectiveness and robustness across various open-source and closed-source LLMs of different sizes and architectures, achieving an average 99% defense success rate against numerous complex and covert jailbreak methods while maintaining helpfulness on general benchmarks. We further conduct mechanistic interpretability analysis through hidden states and attention distributions, revealing the underlying mechanisms of this detection-generation discrepancy. Our work thus contributes to developing future LLMs with coherent safety awareness and generation behavior. Our code and datasets are publicly available at this https URL.
zh

[NLP-187] ny QA Benchmark: Ultra-Lightweight Synthetic Multilingual Dataset Generation Smoke-Tests for Continuous LLM Evaluation

【速读】: 该论文试图解决大型语言模型(Large-Language-Model, LLM)开发过程中因依赖重型基准测试而影响开发效率的问题,特别是在Prompt优化工具链中需要快速反馈以保持开发流程的连续性。解决方案的关键在于构建了一个超轻量级、多语言的烟雾测试套件Tiny QA Benchmark++(TQB++),其包含一个52项的英文黄金数据集和一个基于无供应商依赖的LiteLLM构建的微型合成数据生成器,使实践者能够自定义任意语言、领域或难度的小型数据包,并通过Croissant元数据和即插即用文件支持多种评估框架和持续集成工具,从而在不影响GPU预算的前提下实现高效的微基准测试集成。

链接: https://arxiv.org/abs/2505.12058
作者: Vincent Koc
机构: Comet ML, Inc. (Comet ML, 公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages, 7 figures, 3 tables. Includes expanded appendix full score matrices. Dataset code: HF Hub + GitHub + Pypi links in abstract. Core data and code Apache-2.0; synthetic packs eval-only

点击查看摘要

Abstract:Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.
zh

[NLP-188] GenderBench: Evaluation Suite for Gender Biases in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在性别偏见的问题,特别是评估和量化LLMs在生成文本中表现出的与性别相关的有害行为。解决方案的关键是提出GenderBench——一个全面的评估套件,包含14个探测器,用于量化19种与性别相关的行为偏差,并通过开源和可扩展的库形式发布,以提高基准测试的可重复性和鲁棒性。

链接: https://arxiv.org/abs/2505.12054
作者: Matúš Pikuliak
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present GenderBench – a comprehensive evaluation suite designed to measure gender biases in LLMs. GenderBench includes 14 probes that quantify 19 gender-related harmful behaviors exhibited by LLMs. We release GenderBench as an open-source and extensible library to improve the reproducibility and robustness of benchmarking across the field. We also publish our evaluation of 12 LLMs. Our measurements reveal consistent patterns in their behavior. We show that LLMs struggle with stereotypical reasoning, equitable gender representation in generated texts, and occasionally also with discriminatory behavior in high-stakes scenarios, such as hiring.
zh

[NLP-189] ABoN: Adaptive Best-of-N Alignment

【速读】: 该论文试图解决在测试阶段对语言模型(Language Model, LM)进行对齐时计算资源分配效率低的问题,尤其是在统一应用对齐方法于不同提示(prompt)时未能考虑对齐难度差异导致的计算成本过高。解决方案的关键在于提出一种提示自适应的Best-of-N对齐策略,通过两阶段算法实现推理时计算资源的高效分配:第一阶段使用较小的探索预算估计每个提示的奖励分布,第二阶段基于这些估计结果自适应地分配剩余预算。该方法简单、实用,并且兼容任何LM/RM组合。

链接: https://arxiv.org/abs/2505.12050
作者: Vinod Raman,Hilal Asi,Satyen Kale
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

Abstract:Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM/RM combination. Empirical results on the AlpacaEval dataset for 12 LM/RM pairs and 50 different batches of prompts show that our adaptive strategy consistently outperforms the uniform allocation with the same inference budget. Moreover, our experiments show that our adaptive strategy remains competitive against uniform allocations with 20% larger inference budgets and even improves in performance as the batch size grows.
zh

[NLP-190] MoL for LLM s: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在领域特定应用中出现的幻觉和准确性限制问题,以及传统课程预训练(Curriculum Pre-Training, CPT)方法中存在的领域数据偏差导致通用语言能力退化和语料混合比例不当影响有效适应的问题。其解决方案的关键在于提出一种新的框架——损失混合(Mixture of Losses, MoL),通过解耦领域数据与通用语料的优化目标,采用交叉熵(Cross-Entropy, CE)损失确保领域知识获取,同时利用Kullback-Leibler(KL)散度对齐通用语料训练与基础模型的通用能力,从而在保持通用技能的同时增强领域专长,避免灾难性遗忘。

链接: https://arxiv.org/abs/2505.12043
作者: Jingxue Chen,Qingkun Tang,Qianchun Lu,Siyuan Fang
机构: ZTE Corporation (中兴通讯); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Although LLMs perform well in general tasks, domain-specific applications suffer from hallucinations and accuracy limitations. CPT approaches encounter two key issues: (1) domain-biased data degrades general language skills, and (2) improper corpus-mixture ratios limit effective adaptation. To address these, we propose a novel framework, Mixture of Losses (MoL), which decouples optimization objectives for domain-specific and general corpora. Specifically, cross-entropy (CE) loss is applied to domain data to ensure knowledge acquisition, while Kullback-Leibler (KL) divergence aligns general-corpus training with the base model’s foundational capabilities. This dual-loss architecture preserves universal skills while enhancing domain expertise, avoiding catastrophic forgetting. Empirically, we validate that a 1:1 domain-to-general corpus ratio optimally balances training and overfitting without the need for extensive tuning or resource-intensive experiments. Furthermore, our experiments demonstrate significant performance gains compared to traditional CPT approaches, which often suffer from degradation in general language capabilities; our model achieves 27.9% higher accuracy on the Math-500 benchmark in the non-think reasoning mode, and an impressive 83.3% improvement on the challenging AIME25 subset in the think mode, underscoring the effectiveness of our approach.
zh

[NLP-191] AI-Driven Automation Can Become the Foundation of Next-Era Science of Science Research

【速读】: 该论文试图解决传统科学学(Science of Science, SoS)方法在捕捉现代科研生态系统复杂性和规模上的不足,这些问题通常由于依赖简化假设和基础统计工具(如线性回归和基于规则的模拟)而难以有效解决。论文提出的解决方案的关键在于利用人工智能(Artificial Intelligence, AI)技术,实现大规模模式发现的自动化,并揭示此前无法获取的见解,从而推动科学效率和创新的提升。AI的优势体现在其处理复杂数据和动态系统的能力,而论文也探讨了相关限制并提出了克服路径。

链接: https://arxiv.org/abs/2505.12039
作者: Renqi Chen,Haoyang Su,Shixiang Tang,Zhenfei Yin,Qi Wu,Hui Li,Ye Sun,Nanqing Dong,Wanli Ouyang,Philip Torr
机构: 1(第一单位); 2(第二单位); 3(第三单位); 4(第四单位); 5(第五单位); 6(第六单位); 11(第十一单位); 222(第二百二十二单位)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:The Science of Science (SoS) explores the mechanisms underlying scientific discovery, and offers valuable insights for enhancing scientific efficiency and fostering innovation. Traditional approaches often rely on simplistic assumptions and basic statistical tools, such as linear regression and rule-based simulations, which struggle to capture the complexity and scale of modern research ecosystems. The advent of artificial intelligence (AI) presents a transformative opportunity for the next generation of SoS, enabling the automation of large-scale pattern discovery and uncovering insights previously unattainable. This paper offers a forward-looking perspective on the integration of Science of Science with AI for automated research pattern discovery and highlights key open challenges that could greatly benefit from AI. We outline the advantages of AI over traditional methods, discuss potential limitations, and propose pathways to overcome them. Additionally, we present a preliminary multi-agent system as an illustrative example to simulate research societies, showcasing AI’s ability to replicate real-world research patterns and accelerate progress in Science of Science research.
zh

[NLP-192] owards Comprehensive Argument Analysis in Education: Dataset Tasks and Method ACL2025

【速读】: 该论文旨在解决现有论点关系过于简单化、难以全面捕捉论点信息的问题,尤其是在复杂现实场景中对论点结构的表示存在不足。其解决方案的关键在于提出14种细粒度的论点关系类型,从垂直和水平两个维度出发,以更全面地捕捉论点成分之间的复杂互动,从而深入理解论点结构。

链接: https://arxiv.org/abs/2505.12028
作者: Yupei Ren,Xinyi Zhou,Ning Zhang,Shangqing Zhao,Man Lan,Xiaopeng Bai
机构: Lab of Artificial Intelligence for Education, East China Normal University (教育人工智能实验室,华东师范大学); Shanghai Institute of Artificial Intelligence for Education, East China Normal University (教育人工智能上海研究院,华东师范大学); School of Computer Science and Technology, East China Normal University (计算机科学与技术学院,华东师范大学); Department of Chinese Language and Literature, East China Normal University (中国语言文学系,华东师范大学); College of Education, Zhejiang University (教育学院,浙江大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025; 13 pages, 3 figures

点击查看摘要

Abstract:Argument mining has garnered increasing attention over the years, with the recent advancement of Large Language Models (LLMs) further propelling this trend. However, current argument relations remain relatively simplistic and foundational, struggling to capture the full scope of argument information, particularly when it comes to representing complex argument structures in real-world scenarios. To address this limitation, we propose 14 fine-grained relation types from both vertical and horizontal dimensions, thereby capturing the intricate interplay between argument components for a thorough understanding of argument structure. On this basis, we conducted extensive experiments on three tasks: argument component detection, relation prediction, and automated essay grading. Additionally, we explored the impact of writing quality on argument component detection and relation prediction, as well as the connections between discourse relations and argumentative features. The findings highlight the importance of fine-grained argumentative annotations for argumentative writing quality assessment and encourage multi-dimensional argument analysis.
zh

[NLP-193] Unveiling Knowledge Utilization Mechanisms in LLM -based Retrieval-Augmented Generation SIGIR2025

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的检索增强生成(Retrieval-Augmented Generation, RAG)系统中,内部参数化知识与外部检索知识融合机制不明确的问题。其解决方案的关键在于通过知识流分析,将知识利用过程分解为四个阶段:知识精炼、知识提取、知识表达和知识争论,并引入知识激活概率熵(Knowledge Activation Probability Entropy, KAPE)方法,用于识别与内部或外部知识相关的神经元,从而实现对模型知识依赖性的定向调控。此外,研究还揭示了多头注意力和多层感知机在知识构建中的互补作用,为提升RAG系统的可解释性与可靠性提供了理论基础。

链接: https://arxiv.org/abs/2505.11995
作者: Yuhao Wang,Ruiyang Ren,Yucheng Wang,Wayne Xin Zhao,Jing Liu,Hua Wu,Haifeng Wang
机构: Renmin University of China (中国人民大学); Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL)
备注: SIGIR 2025

点击查看摘要

Abstract:Considering the inherent limitations of parametric knowledge in large language models (LLMs), retrieval-augmented generation (RAG) is widely employed to expand their knowledge scope. Since RAG has shown promise in knowledge-intensive tasks like open-domain question answering, its broader application to complex tasks and intelligent assistants has further advanced its utility. Despite this progress, the underlying knowledge utilization mechanisms of LLM-based RAG remain underexplored. In this paper, we present a systematic investigation of the intrinsic mechanisms by which LLMs integrate internal (parametric) and external (retrieved) knowledge in RAG scenarios. Specially, we employ knowledge stream analysis at the macroscopic level, and investigate the function of individual modules at the microscopic level. Drawing on knowledge streaming analyses, we decompose the knowledge utilization process into four distinct stages within LLM layers: knowledge refinement, knowledge elicitation, knowledge expression, and knowledge contestation. We further demonstrate that the relevance of passages guides the streaming of knowledge through these stages. At the module level, we introduce a new method, knowledge activation probability entropy (KAPE) for neuron identification associated with either internal or external knowledge. By selectively deactivating these neurons, we achieve targeted shifts in the LLM’s reliance on one knowledge source over the other. Moreover, we discern complementary roles for multi-head attention and multi-layer perceptron layers during knowledge formation. These insights offer a foundation for improving interpretability and reliability in retrieval-augmented LLMs, paving the way for more robust and transparent generative solutions in knowledge-intensive domains.
zh

[NLP-194] Introduction to Analytical Software Engineering Design Paradigm

【速读】: 该论文试图解决现代软件系统在建模与构造过程中面临的复杂性问题,尤其是传统方法在设计模式检测、代码重构以及系统优化和长期可持续性方面的不足。解决方案的关键在于提出一种新的设计范式——分析型软件工程(Analytical Software Engineering, ASE),该范式通过平衡抽象、工具可访问性、兼容性和可扩展性,实现对复杂软件工程问题的有效建模与解决。ASE的核心在于其两个框架:行为-结构序列(Behavioral-Structural Sequences, BSS)和优化设计重构(Optimized Design Refactoring, ODR),分别用于精确的设计模式检测和通过启发式算法优化代码重构,从而减少计算开销。

链接: https://arxiv.org/abs/2505.11979
作者: Tarik Houichime,Younes El Amrani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Mathematical Software (cs.MS); Programming Languages (cs.PL)
备注: The Conference’s autorization to submit a preprint was granted

点击查看摘要

Abstract:As modern software systems expand in scale and complexity, the challenges associated with their modeling and formulation grow increasingly intricate. Traditional approaches often fall short in effectively addressing these complexities, particularly in tasks such as design pattern detection for maintenance and assessment, as well as code refactoring for optimization and long-term sustainability. This growing inadequacy underscores the need for a paradigm shift in how such challenges are approached and resolved. This paper presents Analytical Software Engineering (ASE), a novel design paradigm aimed at balancing abstraction, tool accessibility, compatibility, and scalability. ASE enables effective modeling and resolution of complex software engineering problems. The paradigm is evaluated through two frameworks Behavioral-Structural Sequences (BSS) and Optimized Design Refactoring (ODR), both developed in accordance with ASE principles. BSS offers a compact, language-agnostic representation of codebases to facilitate precise design pattern detection. ODR unifies artifact and solution representations to optimize code refactoring via heuristic algorithms while eliminating iterative computational overhead. By providing a structured approach to software design challenges, ASE lays the groundwork for future research in encoding and analyzing complex software metrics.
zh

[NLP-195] An Annotated Corpus of Arabic Tweets for Hate Speech Analysis

【速读】: 该论文试图解决阿拉伯语中仇恨言论内容识别的问题,这一问题由于阿拉伯语方言的丰富性而具有挑战性。解决方案的关键在于构建一个多层次的阿拉伯语仇恨言论数据集,该数据集包含10000条阿拉伯语推文,并对每条推文进行标注,判断其是否包含冒犯性内容,若包含则进一步分类为不同的仇恨言论目标,如宗教、性别、政治、种族、起源等。此外,通过多标注者参与数据标注任务并计算标注者间的一致性,确保了数据的可靠性和质量。最终,采用基于Transformer的模型对数据集进行了评估,其中AraBERTv2在微F1分数和准确率上表现最佳。

链接: https://arxiv.org/abs/2505.11969
作者: Md. Rafiul Biswas,Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Identifying hate speech content in the Arabic language is challenging due to the rich quality of dialectal variations. This study introduces a multilabel hate speech dataset in the Arabic language. We have collected 10000 Arabic tweets and annotated each tweet, whether it contains offensive content or not. If a text contains offensive content, we further classify it into different hate speech targets such as religion, gender, politics, ethnicity, origin, and others. A text can contain either single or multiple targets. Multiple annotators are involved in the data annotation task. We calculated the inter-annotator agreement, which was reported to be 0.86 for offensive content and 0.71 for multiple hate speech targets. Finally, we evaluated the data annotation task by employing a different transformers-based model in which AraBERTv2 outperformed with a micro-F1 score of 0.7865 and an accuracy of 0.786.
zh

[NLP-196] CCNU at SemEval-2025 Task 3: Leverag ing Internal and External Knowledge of Large Language Models for Multilingual Hallucination Annotation SEMEVAL-2025

【速读】: 该论文旨在解决问答系统中幻觉(hallucination)识别的问题,特别是在14种不同语言下的检测。其解决方案的关键在于利用多个具有不同专业领域的大型语言模型(Large Language Models, LLMs),并行进行幻觉标注,从而模拟众包标注过程。此外,每个基于LLM的标注器在标注过程中整合了与输入相关的内部和外部知识,提升了标注的准确性和全面性。

链接: https://arxiv.org/abs/2505.11965
作者: Xu Liu,Guanyi Chen
机构: Central China Normal University (中南民族大学)
类目: Computation and Language (cs.CL)
备注: SemEval-2025 Task 3

点击查看摘要

Abstract:We present the system developed by the Central China Normal University (CCNU) team for the Mu-SHROOM shared task, which focuses on identifying hallucinations in question-answering systems across 14 different languages. Our approach leverages multiple Large Language Models (LLMs) with distinct areas of expertise, employing them in parallel to annotate hallucinations, effectively simulating a crowdsourcing annotation process. Furthermore, each LLM-based annotator integrates both internal and external knowledge related to the input during the annotation process. Using the open-source LLM DeepSeek-V3, our system achieves the top ranking (#1) for Hindi data and secures a Top-5 position in seven other languages. In this paper, we also discuss unsuccessful approaches explored during our development process and share key insights gained from participating in this shared task.
zh

[NLP-197] EmoHopeSpeech: An Annotated Dataset of Emotions and Hope Speech in English

【速读】: 该论文旨在解决多情感(Emotion and hope)数据集在阿拉伯语和英语中的稀缺性问题,特别是针对情感和希望言论(hope speech)的标注数据不足。其解决方案的关键在于构建一个双语数据集,包含23,456条阿拉伯语条目和10,036条英语条目,并通过详细的标注涵盖情感强度、复杂性和原因,以及希望言论的分类与子类别。为确保标注的一致性,采用了Fleiss’ Kappa方法,获得了0.75-0.85的标注者间一致性,同时通过基准模型的微F1分数(micro-F1-Score=0.67)验证了数据标注的有效性。

链接: https://arxiv.org/abs/2505.11959
作者: Md. Rafiul Biswas,Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research introduces a bilingual dataset comprising 23,456 entries for Arabic and 10,036 entries for English, annotated for emotions and hope speech, addressing the scarcity of multi-emotion (Emotion and hope) datasets. The dataset provides comprehensive annotations capturing emotion intensity, complexity, and causes, alongside detailed classifications and subcategories for hope speech. To ensure annotation reliability, Fleiss’ Kappa was employed, revealing 0.75-0.85 agreement among annotators both for Arabic and English language. The evaluation metrics (micro-F1-Score=0.67) obtained from the baseline model (i.e., using a machine learning model) validate that the data annotations are worthy. This dataset offers a valuable resource for advancing natural language processing in underrepresented languages, fostering better cross-linguistic analysis of emotions and hope speech.
zh

[NLP-198] Counterspeech the ultimate shield! Multi-Conditioned Counterspeech Generation through Attributed Prefix Learning

【速读】: 该论文旨在解决现有对抗仇恨言论(counterspeech)生成方法仅基于单一属性进行条件生成的问题,从而导致响应不够细致和有效。其解决方案的关键在于提出HiPPrO(Hierarchical Prefix learning with Preference Optimization)框架,该框架通过分阶段优化属性特定的前缀嵌入空间,实现对多属性(如意图与情感)的协同建模,并结合参考与无奖励偏好优化策略,以生成更具建设性和相关性的对抗性言论。

链接: https://arxiv.org/abs/2505.11958
作者: Aswini Kumar Padhi,Anil Bandhakavi,Tanmoy Chakraborty
机构: IIT Delhi (印度理工学院德里分校); Logically.ai (Logically.ai)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Counterspeech has proven to be a powerful tool to combat hate speech online. Previous studies have focused on generating counterspeech conditioned only on specific intents (single attributed). However, a holistic approach considering multiple attributes simultaneously can yield more nuanced and effective responses. Here, we introduce HiPPrO, Hierarchical Prefix learning with Preference Optimization, a novel two-stage framework that utilizes the effectiveness of attribute-specific prefix embedding spaces hierarchically optimized during the counterspeech generation process in the first phase. Thereafter, we incorporate both reference and reward-free preference optimization to generate more constructive counterspeech. Furthermore, we extend IntentCONANv2 by annotating all 13,973 counterspeech instances with emotion labels by five annotators. HiPPrO leverages hierarchical prefix optimization to integrate these dual attributes effectively. An extensive evaluation demonstrates that HiPPrO achieves a ~38 % improvement in intent conformity and a ~3 %, ~2 %, ~3 % improvement in Rouge-1, Rouge-2, and Rouge-L, respectively, compared to several baseline models. Human evaluations further substantiate the superiority of our approach, highlighting the enhanced relevance and appropriateness of the generated counterspeech. This work underscores the potential of multi-attribute conditioning in advancing the efficacy of counterspeech generation systems.
zh

[NLP-199] ChartEdit: How Far Are MLLM s From Automating Chart Analysis? Evaluating MLLM s Capability via Chart Editing ACL2025

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图表编辑任务中的性能评估与能力局限问题。当前的评估方法多依赖于有限的案例研究,缺乏系统性的评价框架,导致对模型实际编辑能力的认知不足。为应对这一问题,作者提出了ChartEdit,一个高质量的图表编辑基准,包含1,405条多样化的编辑指令和233张真实图表,每条指令均经过人工标注与验证。该基准支持在代码和图表两个层面评估模型性能,揭示了现有大型模型在生成符合指令的准确编辑方面仍存在显著挑战,而小型模型则在遵循指令和生成整体图表图像方面表现更差,凸显了该领域仍需进一步研究与改进。

链接: https://arxiv.org/abs/2505.11935
作者: Xuanle Zhao,Xuexin Liu,Haoyue Yang,Xianzhen Luo,Fanhu Zeng,Jianling Li,Qi Shi,Chi Chen
机构: Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院); Harbin Institute of Technology(哈尔滨工业大学); Tianjin University(天津大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL)
备注: Accept by ACL2025 Findings, preprint version

点击查看摘要

Abstract:Although multimodal large language models (MLLMs) show promise in generating chart rendering code, chart editing presents a greater challenge. This difficulty stems from its nature as a labor-intensive task for humans that also demands MLLMs to integrate chart understanding, complex reasoning, and precise intent interpretation. While many MLLMs claim such editing capabilities, current assessments typically rely on limited case studies rather than robust evaluation methodologies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose ChartEdit, a new high-quality benchmark designed for chart editing tasks. This benchmark comprises 1,405 diverse editing instructions applied to 233 real-world charts, with each instruction-chart instance having been manually annotated and validated for accuracy. Utilizing ChartEdit, we evaluate the performance of 10 mainstream MLLMs across two types of experiments, assessing them at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only 59.96 , highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at this https URL.
zh

[NLP-200] Neuro-Symbolic Query Compiler ACL2025

【速读】: 该论文试图解决在检索增强生成(Retrieval-Augmented Generation, RAG)系统中,如何精确识别搜索意图的问题,尤其是在资源受限和面对具有嵌套结构与依赖关系的复杂查询时。解决方案的关键在于提出QCompiler,这是一个受语言学语法规则和编译器设计启发的神经符号框架。其核心是理论设计了一个最小但充分的巴科斯-诺尔范式(Backus-Naur Form, BNF)语法G[q],以形式化复杂查询,该语法在保持完整性的同时最小化冗余。基于此,QCompiler通过查询表达式翻译器、词法语法解析器和递归下降处理器将查询编译为抽象语法树(Abstract Syntax Trees, ASTs),从而提升RAG系统处理复杂查询的能力。

链接: https://arxiv.org/abs/2505.11932
作者: Yuyao Zhang,Zhicheng Dou,Xiaoxi Li,Jiajie Jin,Yongkang Wu,Zhonghua Li,Qi Ye,Ji-Rong Wen
机构: Renmin University of China (中国人民大学); Huawei Poisson Lab (华为泊松实验室)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Findings of ACL2025, codes are available at this url: this https URL

点击查看摘要

Abstract:Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar G[q] to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system’s ability to address complex queries.
zh

[NLP-201] An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts

【速读】: 该论文试图解决语言模型在无外部反馈的情况下通过内在自校正(intrinsic self-correction)提升性能的机制问题,特别是探讨提示(prompt)如何引起隐藏状态的可解释变化并影响输出分布。解决方案的关键在于假设每个提示引起的特征空间变化位于某些线性表示向量的线性空间内,从而根据个体概念对齐自然分离标记,并基于此提出自校正的数学表述及输出标记的集中性结果。实验表明,在有毒指令下,提示引起的特征变化与前100个最具毒性标记的解嵌入之间的内积显著大于与后100个最不具毒性标记的解嵌入之间的内积,这表明自校正提示增强了语言模型对潜在概念的识别能力。

链接: https://arxiv.org/abs/2505.11924
作者: Yu-Ting Lee,Hui-Ying Shih,Fu-Chieh Chang,Pei-Yuan Wu
机构: National Chengchi University (国立政治大学); National Tsing Hua University (国立清华大学); MediaTek Research (联发科技研究); National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We provide an explanation for the performance gains of intrinsic self-correction, a process where a language model iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a language model’s capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.
zh

[NLP-202] Enhancing Complex Instruction Following for Large Language Models with Mixture-of-Contexts Fine-tuning

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理包含多个约束的复杂指令时表现不佳的问题,尤其是在监督微调(Supervised Fine-Tuning, SFT)过程中由于对关键子上下文关注不足而影响微调效果的问题。其解决方案的关键在于将顺序结构的输入指令转换为包含子上下文的多个并行指令,并引入MISO(Multi-Input Single-Output)架构,该架构通过混合上下文范式同时考虑整体指令-输出对齐和个体子上下文的影响,从而提升SFT的有效性。

链接: https://arxiv.org/abs/2505.11922
作者: Yuheng Lu,ZiMeng Bai,Caixia Yuan,Huixing Jiang,Xiaojie Wang
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications (人工智能学院,北京邮电大学); LI Auto Inc. (理想汽车公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable capabilities in handling natural language tasks; however, they may struggle to consistently follow complex instructions including those involve multiple constraints. Post-training LLMs using supervised fine-tuning (SFT) is a standard approach to improve their ability to follow instructions. In addressing complex instruction following, existing efforts primarily focus on data-driven methods that synthesize complex instruction-output pairs for SFT. However, insufficient attention allocated to crucial sub-contexts may reduce the effectiveness of SFT. In this work, we propose transforming sequentially structured input instruction into multiple parallel instructions containing subcontexts. To support processing this multi-input, we propose MISO (Multi-Input Single-Output), an extension to currently dominant decoder-only transformer-based LLMs. MISO introduces a mixture-of-contexts paradigm that jointly considers the overall instruction-output alignment and the influence of individual sub-contexts to enhance SFT effectiveness. We apply MISO fine-tuning to complex instructionfollowing datasets and evaluate it with standard LLM inference. Empirical results demonstrate the superiority of MISO as a fine-tuning method for LLMs, both in terms of effectiveness in complex instruction-following scenarios and its potential for training efficiency.
zh

[NLP-203] ELITE: Embedding-Less retrieval with Iterative Text Exploration

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理文档级或多轮任务时,由于长期上下文记忆能力有限而导致性能受限的问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)系统通常依赖于基于嵌入的检索方法,该方法在语义相似性层面进行训练,可能导致检索到的内容与问题的真实意图不匹配。此外,一些RAG变体通过构建图结构或层次结构来提升检索准确性,但带来了显著的计算和存储开销。该论文提出了一种无需嵌入的检索框架,其关键在于利用LLMs的逻辑推理能力,通过迭代搜索空间精炼以及新颖的重要性度量进行检索,并在不显式构建图结构的情况下扩展逻辑相关的信息,从而在保持高性能的同时大幅降低存储和运行时间。

链接: https://arxiv.org/abs/2505.11908
作者: Zhangyu Wang,Siyuan Gao,Rong Zhou,Hao Wang,Li Ning
机构: University of Southern California (南加州大学); Jilin University (吉林大学); National Supercomputing Center in Shenzhen (深圳国家超算中心); Wuhan University (武汉大学); Stellaris AI Limited (Stellaris AI 有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive progress in natural language processing, but their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks. Retrieval-Augmented Generation (RAG) mitigates this by retrieving relevant information from an external corpus. However, existing RAG systems often rely on embedding-based retrieval trained on corpus-level semantic similarity, which can lead to retrieving content that is semantically similar in form but misaligned with the question’s true intent. Furthermore, recent RAG variants construct graph- or hierarchy-based structures to improve retrieval accuracy, resulting in significant computation and storage overhead. In this paper, we propose an embedding-free retrieval framework. Our method leverages the logical inferencing ability of LLMs in retrieval using iterative search space refinement guided by our novel importance measure and extend our retrieval results with logically related information without explicit graph construction. Experiments on long-context QA benchmarks, including NovelQA and Marathon, show that our approach outperforms strong baselines while reducing storage and runtime by over an order of magnitude.
zh

[NLP-204] Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data ACL2025

【速读】: 该论文试图解决在混合数据源(如文本和表格)上进行问答的问题,特别是在个人信息场景下,如何在保持用户数据本地化的同时提供高效、便捷的访问方式。解决方案的关键在于提出ReQAP方法,该方法通过递归分解生成可执行的操作符树,操作符设计旨在实现结构化与非结构化数据的无缝集成,从而生成可追溯的答案。

链接: https://arxiv.org/abs/2505.11900
作者: Philipp Christmann,Gerhard Weikum
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Saarland Informatics Campus (萨尔兰信息学园区)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at ACL 2025 (Findings)

点击查看摘要

Abstract:Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a language model. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.
zh

[NLP-205] RLAP: A Reinforcement Learning Enhanced Adaptive Planning Framework for Multi-step NLP Task Solving

【速读】: 该论文试图解决多步骤自然语言处理(NLP)任务中,现有方法因忽视实例的语言特征并依赖大语言模型(LLM)内在规划能力而导致的次优结果问题。解决方案的关键在于提出一种强化学习增强的自适应规划框架(Reinforcement Learning enhanced Adaptive Planning, RLAP),该框架将NLP任务建模为马尔可夫决策过程(Markov decision process, MDP),并通过强化学习训练一个轻量级的Actor模型来估计包含状态和动作的自然语言序列的Q值,从而在顺序规划过程中考虑每个序列的语言特征,并与LLM交互以确定每个任务实例的最优子任务顺序。

链接: https://arxiv.org/abs/2505.11893
作者: Zepeng Ding,Dixuan Wang,Ziqin Luo,Guochao Jiang,Deqing Yang,Jiaqing Liang
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-step planning has been widely employed to enhance the performance of large language models (LLMs) on downstream natural language processing (NLP) tasks, which decomposes the original task into multiple subtasks and guide LLMs to solve them sequentially without additional training. When addressing task instances, existing methods either preset the order of steps or attempt multiple paths at each step. However, these methods overlook instances’ linguistic features and rely on the intrinsic planning capabilities of LLMs to evaluate intermediate feedback and then select subtasks, resulting in suboptimal outcomes. To better solve multi-step NLP tasks with LLMs, in this paper we propose a Reinforcement Learning enhanced Adaptive Planning framework (RLAP). In our framework, we model an NLP task as a Markov decision process (MDP) and employ an LLM directly into the environment. In particular, a lightweight Actor model is trained to estimate Q-values for natural language sequences consisting of states and actions through reinforcement learning. Therefore, during sequential planning, the linguistic features of each sequence in the MDP can be taken into account, and the Actor model interacts with the LLM to determine the optimal order of subtasks for each task instance. We apply RLAP on three different types of NLP tasks and conduct extensive experiments on multiple datasets to verify RLAP’s effectiveness and robustness.
zh

[NLP-206] Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

【速读】: 该论文旨在解决现有基准测试在评估基于视觉语言模型(VLM)的移动代理时存在的局限性,包括难以获得稳定的奖励信号、无法体现GUI任务的多解特性以及缺乏对噪声环境和主动交互能力的评估。其解决方案的关键在于构建一个更真实且全面的基准测试——Mobile-Bench-v2,该基准包含通用任务划分、离线多路径评估、基于弹窗和广告应用的噪声划分以及包含预设问答交互的模糊指令划分,以全面评估移动代理的能力。

链接: https://arxiv.org/abs/2505.11891
作者: Weikai Xu,Zhizheng Jiang,Yuxuan Liu,Wei Liu,Jian Luan,Yuanchun Li,Yunxin Liu,Bin Wang,Bo An
机构: Nanyang Technological University (南洋理工大学); University of Electronic Science and Technology of China (中国电子科技大学); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); XiaoMi AI Lab (小米人工智能实验室); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent’s ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q\A interactions is released to evaluate the agent’s proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at this https URL.
zh

[NLP-207] AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

【速读】: 该论文旨在解决医疗领域大型语言模型(Large Language Models, LLMs)评估方法不足的问题,特别是传统评估指标如F1和ROUGE因依赖于词元重叠而忽视了医学术语的重要性,以及人工评估成本高且可能因专家知识和动机限制而产生误差。为了解决这些问题,作者提出了AutoMedEval,这是一个参数量为13B的开源自动评估模型,其关键在于采用分层训练方法,包括课程指导微调和迭代知识内省机制,使模型能够在有限的指令数据下获得专业的医学评估能力。

链接: https://arxiv.org/abs/2505.11887
作者: Xiechi Zhang,Zetian Ouyang,Linlin Wang,Gerard de Melo,Zhu Cao,Xiaoling Wang,Ya Zhang,Yanfeng Wang,Liang He
机构: East China Normal University (华东师范大学); Hasso Plattner Institute/University of Potsdam (哈索·普拉特纳研究所/波茨坦大学); Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.
zh

[NLP-208] NAMET: Robust Massive Model Editing via Noise-Aware Memory Optimization

【速读】: 该论文试图解决大规模知识更新场景下现有模型编辑技术效果下降的问题,特别是在实际评估指标或上下文丰富的设置中表现不佳。研究认为,问题的根源在于知识条目之间的嵌入冲突,这会削弱大规模编辑的可靠性。解决方案的关键在于提出NAMET(Noise-aware Model Editing in Transformers),该方法通过在记忆提取过程中引入噪声,对MEMIT进行一行代码的简单修改,从而提升编辑效果。

链接: https://arxiv.org/abs/2505.11876
作者: Yanbo Dai,Zhenlan Ji,Zongjie Li,Shuai Wang
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs). However, the effectiveness of existing approaches degrades in massive editing scenarios, particularly when evaluated with practical metrics or in context-rich settings. We attribute these failures to embedding collisions among knowledge items, which undermine editing reliability at scale. To address this, we propose NAMET (Noise-aware Model Editing in Transformers), a simple yet effective method that introduces noise during memory extraction via a one-line modification to MEMIT. Extensive experiments across six LLMs and three datasets demonstrate that NAMET consistently outperforms existing methods when editing thousands of facts.
zh

[NLP-209] J1: Exploring Simple Test-Time Scaling for LLM -as-a-Judge

【速读】: 该论文旨在解决当前AI系统评估质量不足的问题,特别是在传统评估方法中缺乏可解释性,导致用户难以理解奖励模型对输出评分的原因。其解决方案的关键在于引入一种基于大型语言模型作为评判者(LLM-as-a-Judge)的监督机制,并通过拒绝采样收集增强反思的数据集进行监督微调,随后使用可验证奖励进行强化学习(Reinforcement Learning, RL)训练。此外,在推理阶段应用简单的测试时缩放(Simple Test-Time Scaling, STTS)策略以进一步提升性能,从而实现更高的评估准确性和可解释性。

链接: https://arxiv.org/abs/2505.11875
作者: Chi-Min Chan,Chunpu Xu,Jiaming Ji,Zhen Ye,Pengcheng Wen,Chunyang Jiang,Yaodong Yang,Wei Xue,Sirui Han,Yike Guo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 33 pages, 27 figures

点击查看摘要

Abstract:The current focus of AI research is shifting from emphasizing model training towards enhancing evaluation quality, a transition that is crucial for driving further advancements in AI systems. Traditional evaluation methods typically rely on reward models assigning scalar preference scores to outputs. Although effective, such approaches lack interpretability, leaving users often uncertain about why a reward model rates a particular response as high or low. The advent of LLM-as-a-Judge provides a more scalable and interpretable method of supervision, offering insights into the decision-making process. Moreover, with the emergence of large reasoning models, which consume more tokens for deeper thinking and answer refinement, scaling test-time computation in the LLM-as-a-Judge paradigm presents an avenue for further boosting performance and providing more interpretability through reasoning traces. In this paper, we introduce \textbfJ1-7B , which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling and subsequently trained using Reinforcement Learning (RL) with verifiable rewards. At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement. Experimental results demonstrate that \textbfJ1-7B surpasses the previous state-of-the-art LLM-as-a-Judge by \textbf4.8 % and exhibits a \textbf5.1 % stronger scaling trend under STTS. Additionally, we present three key findings: (1) Existing LLM-as-a-Judge does not inherently exhibit such scaling trend. (2) Model simply fine-tuned on reflection-enhanced datasets continues to demonstrate similarly weak scaling behavior. (3) Significant scaling trend emerges primarily during the RL phase, suggesting that effective STTS capability is acquired predominantly through RL training.
zh

[NLP-210] Fair-PP: A Synthetic Dataset for Aligning LLM with Personalized Preferences of Social Equity

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在个性化偏好建模中忽视个体差异与社会公平性的问题,尤其是现有数据集未能充分捕捉个性化与偏好之间的相关性。解决方案的关键在于构建一个名为Fair-PP的合成个性化偏好数据集,该数据集基于真实社会调查数据,涵盖28个社会群体、98个公平性主题及5个个人偏好维度,并通过GPT-4o-mini进行角色扮演生成大量偏好记录。此外,研究还提出了一种自动化生成偏好数据的框架、个性化偏好空间中的区域定位分析方法以及一种用于个性化偏好对齐的样本重加权策略。

链接: https://arxiv.org/abs/2505.11861
作者: Qi Zhou,Jie Zhang,Dongxia Wang,Qiang Liu,Tianlin Li,Jin Song Dong,Wenhai Wang,Qing Guo
机构: Zhejiang University (浙江大学); IHPC and CFAR, ASTAR (IHPC 和 CFAR,ASTAR); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Human preference plays a crucial role in the refinement of large language models (LLMs). However, collecting human preference feedback is costly and most existing datasets neglect the correlation between personalization and preferences. To address this issue, we introduce Fair-PP, a synthetic dataset of personalized preferences targeting social equity, derived from real-world social survey data, which includes 28 social groups, 98 equity topics, and 5 personal preference dimensions. Leveraging GPT-4o-mini, we engage in role-playing based on seven representative persona portrayals guided by existing social survey data, yielding a total of 238,623 preference records. Through Fair-PP, we also contribute (i) An automated framework for generating preference data, along with a more fine-grained dataset of personalized preferences; (ii) analysis of the positioning of the existing mainstream LLMs across five major global regions within the personalized preference space; and (iii) a sample reweighting method for personalized preference alignment, enabling alignment with a target persona while maximizing the divergence from other personas. Empirical experiments show our method outperforms the baselines.
zh

[NLP-211] When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research

【速读】: 该论文试图解决如何利用生成式 AI (Generative AI) 作为验证者来自动化科学手稿的学术验证问题,而非传统上将其视为生成性合作者。解决方案的关键在于构建一个名为 SPOT 的数据集,该数据集包含 83 篇已发表论文及其对应的 91 个足以引发更正或撤稿的错误,并通过实际作者和人工标注者进行交叉验证,以评估 LLM 在识别这些错误上的能力。

链接: https://arxiv.org/abs/2505.11855
作者: Guijin Son,Jiwoo Hong,Honglu Fan,Heejeong Nam,Hyunwoo Ko,Seungwon Lim,Jinyeop Song,Jinha Choi,Gonçalo Paulo,Youngjae Yu,Stella Biderman
机构: OneLineAI; EleutherAI; KAIST AI; Boeing Korea; Yonsei University; MIT
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts. In this work, we explore a complementary application: using LLMs as verifiers to automate the \textbfacademic verification of scientific manuscripts. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1% recall or 6.1% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings. These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.
zh

[NLP-212] Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

【速读】: 该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在面对视频-文本攻击时的安全性评估问题,现有研究主要关注静态图像输入下的模型脆弱性,而忽略了视频的时序动态特性可能带来的不同安全风险。解决方案的关键在于提出Video-SafetyBench,这是一个首个全面的基准测试平台,用于评估LVLMs在视频-文本攻击下的安全性,其核心包括2,264个涵盖48个细粒度不安全类别的视频-文本对,以及一种可控的视频生成流程,将视频语义分解为主观图像和运动文本以生成语义准确的视频,并引入RJScore这一基于大语言模型的度量标准,以有效评估不确定或边缘化有害输出。

链接: https://arxiv.org/abs/2505.11842
作者: Xuannan Liu,Zekun Li,Zheqi He,Peipei Li,Shuhan Xia,Xing Cui,Huaibo Huang,Xi Yang,Ran He
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Center for Research on Intelligent Perception and Computing, NLPR, CASIA (智能感知与计算研究中心,NLPR,中科院自动化所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
zh

[NLP-213] Multilingual Collaborative Defense for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言场景下的安全性和鲁棒性问题,特别是针对通过将有害查询翻译成罕见或低资源语言来绕过模型防护的“越狱”攻击。其解决方案的关键在于提出一种名为多语言协同防御(Multilingual Collaborative Defense, MCD)的新学习方法,该方法通过自动优化连续的、软性的安全提示,实现对多语言环境下LLMs的有效防护。MCD能够提升多语言场景下的防护性能,保持强泛化能力并降低误拒率,同时缓解因训练语料不平衡导致的语言安全偏差。

链接: https://arxiv.org/abs/2505.11835
作者: Hongliang Li,Jinan Xu,Gengping Cui,Changhao Guan,Fengran Mo,Kaiyu Huang
机构: Beijing Jiaotong University (北京交通大学); Université de Montréal (蒙特利尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 4figures

点击查看摘要

Abstract:The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of “jailbreaking” these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at this https URL.
zh

[NLP-214] Class Distillation with Mahalanobis Contrast: An Efficient Training Paradigm for Prag matic Language Understanding Tasks

【速读】: 该论文旨在解决从在线社交话语中检测异常语言(如性别歧视)或细微语言(如隐喻或讽刺)的问题,此类任务对于提升网络交流的安全性、清晰度和解释性至关重要。现有分类器虽在这些任务上表现良好,但通常计算成本高且数据需求大。该研究提出的解决方案为ClaD(Class Distillation),其关键在于从高度多样且异质的背景中蒸馏出一个小型、明确的目标类别。ClaD通过两个核心创新实现这一目标:一是基于马氏距离的结构属性设计损失函数,二是优化用于类别分离的可解释决策算法。实验结果表明,ClaD在多个基准任务中优于现有基线,并在使用较小的语言模型和参数量的情况下,达到与多个大语言模型相当的性能。

链接: https://arxiv.org/abs/2505.11829
作者: Chenlu Wang,Weimin Lyu,Ritwik Banerjee
机构: Stony Brook University (斯托尼布鲁克大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting deviant language such as sexism, or nuanced language such as metaphors or sarcasm, is crucial for enhancing the safety, clarity, and interpretation of online social discourse. While existing classifiers deliver strong results on these tasks, they often come with significant computational cost and high data demands. In this work, we propose \textbfClass \textbfDistillation (ClaD), a novel training paradigm that targets the core challenge: distilling a small, well-defined target class from a highly diverse and heterogeneous background. ClaD integrates two key innovations: (i) a loss function informed by the structural properties of class distributions, based on Mahalanobis distance, and (ii) an interpretable decision algorithm optimized for class separation. Across three benchmark detection tasks – sexism, metaphor, and sarcasm – ClaD outperforms competitive baselines, and even with smaller language models and orders of magnitude fewer parameters, achieves performance comparable to several large language models (LLMs). These results demonstrate ClaD as an efficient tool for pragmatic language understanding tasks that require gleaning a small target class from a larger heterogeneous background.
zh

[NLP-215] Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning

【速读】: 该论文试图解决长链式思维(long chain-of-thought, CoT)在大型语言模型(large language models, LLMs)中推理效率低下的问题。现有方法对长CoT中的所有思维进行等量压缩,导致推理不够简洁和高效。解决方案的关键在于通过自动长CoT分块和蒙特卡洛回放分析不同思维的重要性,并提出一种理论上受约束的度量标准,以联合评估不同思维的有效性和效率。随后,论文提出Long ⊗ Short框架,利用两个LLMs协同解决问题:一个负责生成重要思维(长思维LLM),另一个负责高效生成剩余思维(短思维LLM)。该框架通过少量冷启动数据微调LLMs以适应不同的推理风格,并引入面向协同的多轮强化学习,促进模型自进化与协作。

链接: https://arxiv.org/abs/2505.11827
作者: Yansong Ning,Wei Li,Jun Fang,Naiqiang Tan,Hao Liu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Didichuxing Co. Ltd(滴滴出行公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In progress

点击查看摘要

Abstract:Compressing long chain-of-thought (CoT) from large language models (LLMs) is an emerging strategy to improve the reasoning efficiency of LLMs. Despite its promising benefits, existing studies equally compress all thoughts within a long CoT, hindering more concise and effective reasoning. To this end, we first investigate the importance of different thoughts by examining their effectiveness and efficiency in contributing to reasoning through automatic long CoT chunking and Monte Carlo rollouts. Building upon the insights, we propose a theoretically bounded metric to jointly measure the effectiveness and efficiency of different thoughts. We then propose Long \otimes Short, an efficient reasoning framework that enables two LLMs to collaboratively solve the problem: a long-thought LLM for more effectively generating important thoughts, while a short-thought LLM for efficiently generating remaining thoughts. Specifically, we begin by synthesizing a small amount of cold-start data to fine-tune LLMs for long-thought and short-thought reasoning styles, respectively. Furthermore, we propose a synergizing-oriented multi-turn reinforcement learning, focusing on the model self-evolution and collaboration between long-thought and short-thought LLMs. Experimental results show that our method enables Qwen2.5-7B and Llama3.1-8B to achieve comparable performance compared to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, while reducing token length by over 80% across the MATH500, AIME24/25, AMC23, and GPQA Diamond benchmarks. Our data and code are available at this https URL.
zh

[NLP-216] Chain-of-Model Learning for Language Model

【速读】: 该论文试图解决传统Transformer架构在模型训练效率和部署灵活性方面的局限性,特别是在模型规模扩展和多尺度推理需求上的不足。其解决方案的关键在于提出一种新的学习范式——Chain-of-Model (CoM),通过将因果关系嵌入每层隐藏状态中并以链式结构组织,实现了模型训练的高效扩展和部署时的灵活推理。此外,论文引入了Chain-of-Representation (CoR)概念,将每层的隐藏状态表示为多个子表示的组合,从而支持通过增加链的数量逐步扩展模型规模,并在推理阶段根据不同的链数量提供多种子模型,增强了系统的弹性。

链接: https://arxiv.org/abs/2505.11820
作者: Kaitao Song,Xiaohua Wang,Xu Tan,Huiqiang Jiang,Chengruidong Zhang,Yongliang Shen,Cen LU,Zihao Li,Zifan Song,Caihua Shan,Yansen Wang,Kan Ren,Xiaoqing Zheng,Tao Qin,Yuqing Yang,Dongsheng Li,Lili Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: this https URL.
zh

[NLP-217] VenusX: Unlocking Fine-Grained Functional Understanding of Proteins

【速读】: 该论文旨在解决蛋白质功能机制理解不足以及模型所捕获生物知识评估不充分的问题,特别是在残基、片段和结构域层面的细粒度功能注释与基于功能的蛋白质配对问题。其解决方案的关键在于提出VenusX,这是一个大规模基准数据集,涵盖了三种主要任务类别和六种类型的注释,包括残基级二分类、片段级多分类以及用于识别关键活性位点、结合位点、保守位点、基序、结构域和表位的功能相似性评分,并通过混合家族和跨家族划分在三个序列相似性阈值下评估模型性能,从而全面评估模型在分布内和分布外场景下的表现。

链接: https://arxiv.org/abs/2505.11812
作者: Yang Tan,Wenrui Gou,Bozitao Zhong,Liang Hong,Huiqun Yu,Bingxin Zhou
机构: Shanghai Jiao Tong University (上海交通大学); East China University of Science and Technology (华东理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 29 pages, 3 figures, 17 tables

点击查看摘要

Abstract:Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large-scale benchmark for fine-grained functional annotation and function-based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open-source databases such as InterPro, BioLiP, and SAbDab. By providing mixed-family and cross-family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in-distribution and out-of-distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open-source models, including pre-trained protein language models, sequence-structure hybrids, structure-based methods, and alignment-based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Code and data are publicly available at this https URL.
zh

[NLP-218] BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering ACL2025

【速读】: 该论文旨在解决多跳问答(multi-hop QA)任务中因问题类型差异导致的模型性能不稳定问题,即不同类型的多跳问题对不同方法的敏感度存在差异。解决方案的关键在于提出一种基于问题类型与方法对应关系的双层级多智能体推理框架(Bi-levEL muLti-agEnt reasoning, BELLE),其中每种方法被视作一个“操作符”,通过不同的提示方式引导大语言模型(LLMs)进行协作与辩论,以生成综合性的执行计划,从而提升多跳QA的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.11811
作者: Taolin Zhang,Dongyang Li,Qizhou Chen,Chengyu Wang,Xiaofeng He
机构: Alibaba Group (阿里巴巴集团); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2025 main track

点击查看摘要

Abstract:Multi-hop question answering (QA) involves finding multiple relevant passages and performing step-by-step reasoning to answer complex questions. Previous works on multi-hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of the question types. In this paper, we first conduct an in-depth analysis of public multi-hop QA benchmarks, dividing the questions into four types and evaluating five types of cutting-edge methods for multi-hop QA: Chain-of-Thought (CoT), Single-step, Iterative-step, Sub-step, and Adaptive-step. We find that different types of multi-hop questions have varying degrees of sensitivity to different types of methods. Thus, we propose a Bi-levEL muLti-agEnt reasoning (BELLE) framework to address multi-hop QA by specifically focusing on the correspondence between question types and methods, where each type of method is regarded as an ‘‘operator’’ by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to obtain an executive plan of combined ‘‘operators’’ to address the multi-hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost-effectiveness than that of single models in more complex multi-hop QA scenarios.
zh

[NLP-219] Efficiently Building a Domain-Specific Large Language Model from Scratch: A Case Study of a Classical Chinese Large Language Model

【速读】: 该论文试图解决通用大语言模型在特定领域(如古典中文文本)中表现不佳的问题,以及开源基础模型微调难以有效融入领域知识的挑战。解决方案的关键在于设计一个专门针对古典中文理解与生成的大语言模型——AI Taiyan,并通过合理的模型架构、数据处理、基础训练和微调策略,在仅使用18亿参数的情况下取得了令人满意的性能,尤其在古典中文信息处理的关键任务中表现出显著优势。

链接: https://arxiv.org/abs/2505.11810
作者: Shen Li,Renfen Hu,Lijun Wang
机构: Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:General-purpose large language models demonstrate notable capabilities in language comprehension and generation, achieving results that are comparable to, or even surpass, human performance in many language information processing tasks. Nevertheless, when general models are applied to some specific domains, e.g., Classical Chinese texts, their effectiveness is often unsatisfactory, and fine-tuning open-source foundational models similarly struggles to adequately incorporate domain-specific knowledge. To address this challenge, this study developed a large language model, AI Taiyan, specifically designed for understanding and generating Classical Chinese. Experiments show that with a reasonable model design, data processing, foundational training, and fine-tuning, satisfactory results can be achieved with only 1.8 billion parameters. In key tasks related to Classical Chinese information processing such as punctuation, identification of allusions, explanation of word meanings, and translation between ancient and modern Chinese, this model exhibits a clear advantage over both general-purpose large models and domain-specific traditional models, achieving levels close to or surpassing human baselines. This research provides a reference for the efficient construction of specialized domain-specific large language models. Furthermore, the paper discusses the application of this model in fields such as the collation of ancient texts, dictionary editing, and language research, combined with case studies.
zh

[NLP-220] Retrospex: Language Agent Meets Offline Reinforcement Learning Critic

【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)代理框架未能充分利用历史经验进行改进的问题。其关键解决方案是提出一种名为Retrospex的新型LLM代理框架,该框架通过深度分析历史经验来提升性能,而非直接将经验整合到LLM的上下文中。Retrospex结合了LLM的动作概率与由强化学习(Reinforcement Learning, RL)评论家估计的动作价值,该评论家通过离线“回顾”过程基于历史经验进行训练,并引入动态动作重评分机制,以增强需要更多环境交互任务中的经验价值重要性。

链接: https://arxiv.org/abs/2505.11807
作者: Yufei Xiang,Yiqun Shen,Yeqin Zhang,Cam-Tu Nguyen
机构: Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM’s context. Instead, it combines the LLM’s action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ‘‘retrospection’’ process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.
zh

[NLP-221] Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors ICML2025

【速读】: 该论文试图解决如何利用神经网络内部机制的可解释性技术来预测模型在分布外(out-of-distribution)示例上的行为问题。其解决方案的关键在于识别那些在模型行为中具有独特因果作用的特征,并基于这些因果机制提出两种方法:反事实模拟(counterfactual simulation)和值探测(value probing),二者均能有效提升模型输出正确性的预测性能,尤其在分布外设置中表现优于依赖非因果特征的方法。

链接: https://arxiv.org/abs/2505.11770
作者: Jing Huang,Junyi Tao,Thomas Icard,Diyi Yang,Christopher Potts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: ICML 2025

点击查看摘要

Abstract:Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks–including symbol manipulation, knowledge retrieval, and instruction following–we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model’s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.
zh

[NLP-222] owards Universal Semantics With Large Language Models

【速读】: 该论文试图解决如何高效生成自然语义元语言(NSM)的解释(explications)问题,这一过程传统上依赖于耗时的人工方法。解决方案的关键在于利用大型语言模型(LLMs)自动生成准确且跨语言可翻译的NSM解释,并通过引入自动评估方法、定制的数据集以及微调模型来提升生成质量。实验表明,所提出的1B和8B模型在生成性能上优于GPT-4o,为实现通用语义表示提供了重要进展。

链接: https://arxiv.org/abs/2505.11764
作者: Raymond Baartmans,Matthew Raffel,Rahul Vikram,Aiden Deringer,Lizhong Chen
机构: Oregon State University (俄勒冈州立大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond.
zh

[NLP-223] Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

【速读】: 该论文试图解决稀疏自编码器(Sparse Autoencoder, SAE)在解码语言模型(LLM)激活时出现的特征混杂问题,即特征对冲(feature hedging)现象。该现象导致SAE无法正确分解多义激活为可解释的单一语义方向,从而影响其性能。解决方案的关键在于理解特征对冲是由SAE重构损失引起的,并且在SAE宽度较小时更为严重。通过理论分析和实验验证,作者提出了改进的马特罗什卡(Matryoshka)SAE变体,以缓解这一问题。

链接: https://arxiv.org/abs/2505.11756
作者: David Chanin,Tomáš Dulka,Adrià Garriga-Alonso
机构: University College London (伦敦大学学院); FAR AI (FAR AI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying “true features” on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Our work shows there remain fundamental issues with SAEs, but we are hopeful that that highlighting feature hedging will catalyze future advances that allow SAEs to achieve their full potential of interpreting LLMs at scale.
zh

[NLP-224] Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation ACL2025

【速读】: 该论文旨在解决多跳问答(Multi-hop Question Answering, MHQA)任务中语言模型(Language Models, LMs)在处理复杂上下文和跨信息源推理时表现受限的问题。其解决方案的关键在于通过调整检索文档的顺序以及改进因果掩码以引入双向注意力机制,从而提升模型在MHQA任务中的性能。研究发现,编码器-解码器模型在MHQA任务中优于因果解码器模型,且文档顺序与推理链顺序一致时表现最佳,同时修改因果掩码以增强双向注意力可有效提升解码器模型的性能。

链接: https://arxiv.org/abs/2505.11754
作者: Wenyu Huang,Pavlos Vougiouklis,Mirella Lapata,Jeff Z. Pan
机构: School of Informatics, University of Edinburgh (信息学院,爱丁堡大学); Huawei Edinburgh Research Centre, Poisson Lab, CSI, UK (华为爱丁堡研究中心,泊松实验室,CSI,英国)
类目: Computation and Language (cs.CL)
备注: ACL 2025 main

点击查看摘要

Abstract:Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs’ performance on this task. Our code is publicly available at this https URL.
zh

[NLP-225] oken Masking Improves Transformer-Based Text Classification

【速读】: 该论文试图解决的是如何进一步提升基于Transformer的文本分类模型的性能,具体而言是探索通过输入标记遮蔽(token masking)是否能够增强模型的有效性。其解决方案的关键在于提出了一种称为“标记遮蔽正则化(token masking regularization)”的方法,该方法在训练过程中以概率p随机将输入标记替换为特殊标记[MASK],从而引入随机扰动,促使模型捕捉更深层次的词间依赖关系。实验结果表明,该方法在语言识别和情感分析任务中均优于传统正则化技术。

链接: https://arxiv.org/abs/2505.11746
作者: Xianglong Xu,John Bowen,Rojin Taheri
机构: University of Pittsburgh (匹兹堡大学); School of Computing and Information (计算机与信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While transformer-based models achieve strong performance on text classification, we explore whether masking input tokens can further enhance their effectiveness. We propose token masking regularization, a simple yet theoretically motivated method that randomly replaces input tokens with a special [MASK] token at probability p. This introduces stochastic perturbations during training, leading to implicit gradient averaging that encourages the model to capture deeper inter-token dependencies. Experiments on language identification and sentiment analysis – across diverse models (mBERT, Qwen2.5-0.5B, TinyLlama-1.1B) – show consistent improvements over standard regularization techniques. We identify task-specific optimal masking rates, with p = 0.1 as a strong general default. We attribute the gains to two key effects: (1) input perturbation reduces overfitting, and (2) gradient-level smoothing acts as implicit ensembling.
zh

[NLP-226] ZeroTuning: Unlocking the Initial Tokens Power to Enhance Large Language Models Without Training

【速读】: 该论文试图解决如何在不进行模型训练的情况下提升大语言模型(Large Language Models, LLMs)性能的问题,特别是针对现有方法依赖辅助机制识别任务相关或无关的token所带来的潜在偏差和适用性限制。论文的核心贡献在于发现语义空的初始token是一个强大且未被充分探索的控制点,通过调整其注意力机制,可以有效优化模型行为。解决方案的关键在于利用该初始token作为注意力汇聚点,通过头特定的注意力调整策略实现对模型行为的优化,从而在多个任务中显著提升模型性能。

链接: https://arxiv.org/abs/2505.11739
作者: Feijiang Han,Xiaodong Yu,Jianheng Tang,Lyle Ungar
机构: University of Pennsylvania(宾夕法尼亚大学); AMD(超微半导体); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, training-free methods for improving large language models (LLMs) have attracted growing interest, with token-level attention tuning emerging as a promising and interpretable direction. However, existing methods typically rely on auxiliary mechanisms to identify important or irrelevant task-specific tokens, introducing potential bias and limiting applicability. In this paper, we uncover a surprising and elegant alternative: the semantically empty initial token is a powerful and underexplored control point for optimizing model behavior. Through theoretical analysis, we show that tuning the initial token’s attention sharpens or flattens the attention distribution over subsequent tokens, and its role as an attention sink amplifies this effect. Empirically, we find that: (1) tuning its attention improves LLM performance more effectively than tuning other task-specific tokens; (2) the effect follows a consistent trend across layers, with earlier layers having greater impact, but varies across attention heads, with different heads showing distinct preferences in how they attend to this token. Based on these findings, we propose ZeroTuning, a training-free approach that improves LLM performance by applying head-specific attention adjustments to this special token. Despite tuning only one token, ZeroTuning achieves higher performance on text classification, multiple-choice, and multi-turn conversation tasks across models such as Llama, Qwen, and DeepSeek. For example, ZeroTuning improves Llama-3.1-8B by 11.71% on classification, 2.64% on QA tasks, and raises its multi-turn score from 7.804 to 7.966. The method is also robust to limited resources, few-shot settings, long contexts, quantization, decoding strategies, and prompt variations. Our work sheds light on a previously overlooked control point in LLMs, offering new insights into both inference-time tuning and model interpretability.
zh

[NLP-227] oken-Level Uncertainty Estimation for Large Language Model Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在不同应用场景中输出质量不一致的问题,特别是在需要多步骤推理的复杂任务中,难以识别可信响应。其解决方案的关键在于提出一种基于标记级别的不确定性估计框架,通过在LLM解码过程中引入低秩随机权重扰动,生成预测分布以估计每个标记的不确定性,并进一步聚合这些不确定性以反映生成序列的语义不确定性。该方法有效提升了模型在数学推理任务中的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.11737
作者: Tunyu Zhang,Haizhou Shi,Yibin Wang,Hengyi Wang,Xiaoxiao He,Zhuowei Li,Haoxian Chen,Ligong Han,Kai Xu,Huan Zhang,Dimitris Metaxas,Hao Wang
机构: Rutgers University ( Rutgers University); University of Illinois Urbana-Champaign (University of Illinois Urbana-Champaign); Amazon (Amazon); Red Hat AI Innovation (Red Hat AI Innovation); Fordham University (Fordham University)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint; Work in progress

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a token-level uncertainty estimation framework to enable LLMs to self-assess and self-improve their generation quality in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation to LLM decoding, generating predictive distributions that we use to estimate token-level uncertainties. We then aggregate these uncertainties to reflect semantic uncertainty of the generated sequences. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that our token-level uncertainty metrics strongly correlate with answer correctness and model robustness. Additionally, we explore using uncertainty to directly enhance the model’s reasoning performance through multiple generations and the particle filtering algorithm. Our approach consistently outperforms existing uncertainty estimation methods, establishing effective uncertainty estimation as a valuable tool for both evaluating and improving reasoning generation in LLMs.
zh

[NLP-228] MedCaseReasoning : Evaluating and learning diagnostic reasoning from clinical case reports

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在医学诊断任务中仅关注最终答案的准确性,而忽视了临床推理过程的质量与忠实性的问题。现有医疗基准如MedQA和MMLU仅评估最终答案的正确性,未能全面反映模型的临床推理能力。为解决这一问题,作者提出了MedCaseReasoning数据集,这是首个用于评估LLMs与临床医生诊断推理对齐情况的开源数据集,其关键在于通过包含14,489个诊断问答对及其来自开放获取医学病例报告的详细推理陈述,全面评估模型的诊断准确性和推理忠实性。

链接: https://arxiv.org/abs/2505.11733
作者: Kevin Wu,Eric Wu,Rahul Thapa,Kevin Wei,Angela Zhang,Arvind Suresh,Jacqueline J. Tao,Min Woo Sun,Alejandro Lozano,James Zou
机构: Stanford University (斯坦福大学); University of Southern California (南加州大学); University of California, San Francisco (加利福尼亚大学旧金山分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at this https URL.
zh

[NLP-229] Efficient Uncertainty Estimation via Distillation of Bayesian Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在下游任务适应过程中不确定性估计的效率问题,特别是传统贝叶斯方法在推理阶段需要多次采样迭代所导致的计算开销过大问题。其解决方案的关键在于通过知识蒸馏(distillation)技术,将预训练贝叶斯LLM的置信度信息转移到非贝叶斯学生模型中,通过最小化两者预测分布之间的差异,从而在不依赖额外验证数据集的情况下实现高效的不确定性估计。该方法在测试阶段可实现N倍的效率提升,其中N为传统贝叶斯LLM所需的样本数量。

链接: https://arxiv.org/abs/2505.11731
作者: Harshil Vejendla,Haizhou Shi,Yibin Wang,Tunyu Zhang,Huan Zhang,Hao Wang
机构: Rutgers University; University of Illinois Urbana-Champaign (UIUC); Red Hat AI Innovation; Fordham University
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint; work in progress

点击查看摘要

Abstract:Recent advances in uncertainty estimation for Large Language Models (LLMs) during downstream adaptation have addressed key challenges of reliability and simplicity. However, existing Bayesian methods typically require multiple sampling iterations during inference, creating significant efficiency issues that limit practical deployment. In this paper, we investigate the possibility of eliminating the need for test-time sampling for LLM uncertainty estimation. Specifically, when given an off-the-shelf Bayesian LLM, we distill its aligned confidence into a non-Bayesian student LLM by minimizing the divergence between their predictive distributions. Unlike typical calibration methods, our distillation is carried out solely on the training dataset without the need of an additional validation dataset. This simple yet effective approach achieves N-times more efficient uncertainty estimation during testing, where N is the number of samples traditionally required by Bayesian LLMs. Our extensive experiments demonstrate that uncertainty estimation capabilities on training data can successfully generalize to unseen test data through our distillation technique, consistently producing results comparable to (or even better than) state-of-the-art Bayesian LLMs.
zh

[NLP-230] Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures ACL2025

【速读】: 该论文旨在解决多模态指代消解(multimodal reference resolution)中的指代关系理解问题,特别是在对话中处理由代词和省略引起的歧义。其解决方案的关键在于通过将指代项嵌入映射到物体嵌入,并基于相似性选择指代项或物体,从而统一文本和多模态的指代消解任务。实验表明,学习文本层面的指代消解(如共指消解和谓词-论元结构分析)能够提升多模态指代消解的性能,尤其在代词短语的视觉定位任务中优于现有模型MDETR和GLIP。

链接: https://arxiv.org/abs/2505.11726
作者: Shun Inadumi,Nobuhiro Ueda,Koichiro Yoshino
机构: Nara Institute of Science and Technology(奈良先端科学技術大学院大学); Guardian Robot Project, RIKEN(理化学研究所守护机器人项目); Kyoto University(京都大学); Institute of Science Tokyo(东京科学大学)
类目: Computation and Language (cs.CL)
备注: ACL2025 main. Code available at this https URL

点击查看摘要

Abstract:Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.
zh

[NLP-231] EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents

【速读】: 该论文试图解决环境提示注入攻击(environmental prompt injection attacks)在多模态大语言模型(Multi-modal large language model, MLLM)驱动的网络代理中存在效果有限、隐蔽性差或在现实场景中不可行的问题。其解决方案的关键在于通过向渲染后的网页原始像素值添加扰动,并利用神经网络近似像素值与截图之间的非可微映射关系,从而实现对网络代理的目标动作诱导。该方法将扰动寻找问题建模为优化问题,并采用投影梯度下降法进行求解,从而有效提升了攻击的效果和隐蔽性。

链接: https://arxiv.org/abs/2505.11717
作者: Xilong Wang,John Bloch,Zedian Shao,Yuepeng Hu,Shuyan Zhou,Neil Zhenqiang Gong
机构: Duke University (杜克大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal large language model (MLLM)-based web agents interact with webpage environments by generating actions based on screenshots of the webpages. Environmental prompt injection attacks manipulate the environment to induce the web agent to perform a specific, attacker-chosen action–referred to as the target action. However, existing attacks suffer from limited effectiveness or stealthiness, or are impractical in real-world settings. In this work, we propose EnvInjection, a new attack that addresses these limitations. Our attack adds a perturbation to the raw pixel values of the rendered webpage, which can be implemented by modifying the webpage’s source code. After these perturbed pixels are mapped into a screenshot, the perturbation induces the web agent to perform the target action. We formulate the task of finding the perturbation as an optimization problem. A key challenge in solving this problem is that the mapping between raw pixel values and screenshot is non-differentiable, making it difficult to backpropagate gradients to the perturbation. To overcome this, we train a neural network to approximate the mapping and apply projected gradient descent to solve the reformulated optimization problem. Extensive evaluation on multiple webpage datasets shows that EnvInjection is highly effective and significantly outperforms existing baselines.
zh

[NLP-232] Hierarchical Bracketing Encodings for Dependency Parsing as Tagging ACL2025

【速读】: 该论文试图解决序列标注依赖句法分析中的编码效率问题,旨在减少用于编码树结构的标签数量并提升对非投射性的支持能力。解决方案的关键在于提出一种最优的分层括号编码方法,该方法通过最小化符号数量,在仅使用12个不同标签的情况下编码投射树(相比现有的4位投射编码使用的16个标签更为高效),并且能够以更紧凑的方式支持任意非投射性。

链接: https://arxiv.org/abs/2505.11693
作者: Ana Ezquerro,David Vilares,Anssi Yli-Jyrä,Carlos Gómez-Rodríguez
机构: Universidade da Coruña, CITIC (@udc.es); Tampere University, Math. Res. Centre, Computing Sciences (@tuni.fi); University of Helsinki, Faculty of Arts, Digital Humanities (@helsinki.fi)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025. Original submission; camera-ready coming soon

点击查看摘要

Abstract:We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We prove that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also extend optimal hierarchical bracketing to support arbitrary non-projectivity in a more compact way than previous encodings. Our new encodings yield competitive accuracy on a diverse set of treebanks.
zh

[NLP-233] Automatic Speech Recognition for African Low-Resource Languages: Challenges and Future Directions

【速读】: 该论文旨在解决非洲低资源语言在自动语音识别(ASR)技术中研究和应用严重不足的问题,其核心挑战包括数据稀缺性、语言复杂性、计算资源有限、声学变异性以及与偏见和隐私相关的伦理问题。论文提出的解决方案关键在于采用社区驱动的数据收集、自监督与多语言学习、轻量级模型架构以及隐私优先的技术,并通过定制化方案如基于词素的建模和特定领域(如医疗和教育)的ASR应用来提升技术可行性与影响力。研究强调了跨学科合作和持续投资的重要性,以应对非洲独特的语言和基础设施挑战。

链接: https://arxiv.org/abs/2505.11690
作者: Sukairaj Hafiz Imam,Babangida Sani,Dawit Ketema Gete,Bedru Yimam Ahamed,Ibrahim Said Ahmad,Idris Abdulmumin,Seid Muhie Yimam,Muhammad Yahuza Bello,Shamsuddeen Hassan Muhammad
机构: Bayero University (贝拉罗大学); Kalinga University (卡林加大学); Debre Birhan University (德布雷比兰大学); Wollo University (沃洛大学); Northeastern University (东北大学); University of Pretoria (普利托利亚大学); University of Hamburg (汉堡大学); Imperial College London (帝国理工学院); EthioNLP (埃塞俄比亚自然语言处理); HausaNLP (豪萨自然语言处理)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) technologies have transformed human-computer interaction; however, low-resource languages in Africa remain significantly underrepresented in both research and practical applications. This study investigates the major challenges hindering the development of ASR systems for these languages, which include data scarcity, linguistic complexity, limited computational resources, acoustic variability, and ethical concerns surrounding bias and privacy. The primary goal is to critically analyze these barriers and identify practical, inclusive strategies to advance ASR technologies within the African context. Recent advances and case studies emphasize promising strategies such as community-driven data collection, self-supervised and multilingual learning, lightweight model architectures, and techniques that prioritize privacy. Evidence from pilot projects involving various African languages showcases the feasibility and impact of customized solutions, which encompass morpheme-based modeling and domain-specific ASR applications in sectors like healthcare and education. The findings highlight the importance of interdisciplinary collaboration and sustained investment to tackle the distinct linguistic and infrastructural challenges faced by the continent. This study offers a progressive roadmap for creating ethical, efficient, and inclusive ASR systems that not only safeguard linguistic diversity but also improve digital accessibility and promote socioeconomic participation for speakers of African languages.
zh

[NLP-234] Evaluating Design Decisions for Dual Encoder-based Entity Disambiguation ACL2025

【速读】: 该论文旨在解决实体消歧(Entity Disambiguation, ED)问题,即在文本中将提及的实体链接到知识库中的正确条目。其解决方案的关键在于评估基于双编码器(Dual Encoder)架构的ED系统中的关键设计决策,包括损失函数、相似性度量、标签表述格式以及负采样策略。研究提出了一种名为VerbalizED的模型,该模型采用上下文感知的标签表述和高效的硬负样本采样方法,并探索了迭代预测变体以提升复杂数据点的消歧效果。实验结果表明,该方法在AIDA-Yago数据集上表现出色,并在ZELDA基准测试中达到了新的最先进水平。

链接: https://arxiv.org/abs/2505.11683
作者: Susanna Rücker,Alan Akbik
机构: Humboldt-Universität zu Berlin (洪堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2025 (The 63rd Annual Meeting of the Association for Computational Linguistics)

点击查看摘要

Abstract:Entity disambiguation (ED) is the task of linking mentions in text to corresponding entries in a knowledge base. Dual Encoders address this by embedding mentions and label candidates in a shared embedding space and applying a similarity metric to predict the correct label. In this work, we focus on evaluating key design decisions for Dual Encoder-based ED, such as its loss function, similarity metric, label verbalization format, and negative sampling strategy. We present the resulting model VerbalizED, a document-level Dual Encoder model that includes contextual label verbalizations and efficient hard negative sampling. Additionally, we explore an iterative prediction variant that aims to improve the disambiguation of challenging data points. Comprehensive experiments on AIDA-Yago validate the effectiveness of our approach, offering insights into impactful design choices that result in a new State-of-the-Art system on the ZELDA benchmark.
zh

[NLP-235] Ambiguity Resolution in Text-to-Structured Data Mapping

【速读】: 该论文试图解决自然语言中的歧义问题,这一问题阻碍了大型语言模型(Large Language Models, LLMs)在文本到结构化数据映射任务中的准确性,特别是在文本到代理工具调用和文本到SQL查询等任务中。论文提出的解决方案的关键在于通过分析歧义文本在潜在空间中的表示差异,并利用这种差异在映射到结构化数据之前识别歧义。为此,作者提出了一种新的距离度量方法,该方法基于稀疏自编码器(Sparse Autoencoder, SAE)下每个概念的梯度值积分计算路径核,以捕捉潜在空间中因概念缺失导致的歧义,并据此识别歧义问题。

链接: https://arxiv.org/abs/2505.11679
作者: Zhibo Hu,Chen Wang,Yanfeng Shu,Hye-Young Paik,Liming Zhu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Ambiguity in natural language is a significant obstacle for achieving accurate text to structured data mapping through large language models (LLMs), which affects the performance of tasks such as mapping text to agentic tool calling and text-to-SQL queries. Existing methods of ambiguity handling either exploit ReACT framework to produce the correct mapping through trial and error, or supervised fine tuning to guide models to produce a biased mapping to improve certain tasks. In this paper, we adopt a different approach that characterizes the representation difference of ambiguous text in the latent space and leverage the difference to identify ambiguity before mapping them to structured data. To detect ambiguity of a sentence, we focused on the relationship between ambiguous questions and their interpretations and what cause the LLM ignore multiple interpretations. Different to the distance calculated by dense embedding vectors, we utilize the observation that ambiguity is caused by concept missing in latent space of LLM to design a new distance measurement, computed through the path kernel by the integral of gradient values for each concepts from sparse-autoencoder (SAE) under each state. We identify patterns to distinguish ambiguous questions with this measurement. Based on our observation, We propose a new framework to improve the performance of LLMs on ambiguous agentic tool calling through missing concepts prediction.
zh

[NLP-236] Multilingual Prompt Engineering in Large Language Models : A Survey Across NLP Tasks

【速读】: 该论文试图解决多语言环境下大型语言模型(Large Language Models, LLMs)有效性不足的问题,特别是如何在不进行大量参数重新训练或微调的情况下提升其跨语言任务的表现。解决方案的关键在于多语言提示工程(Multilingual Prompt Engineering),通过设计结构化的自然语言提示,从LLMs中提取知识,从而增强其在多种语言和任务中的能力。这种方法为非机器学习专业背景的研究者提供了更易访问的途径,以利用LLMs的强大功能。

链接: https://arxiv.org/abs/2505.11665
作者: Shubham Vatsal,Harsh Dubey,Aditi Singh
机构: New York University, CIMS (纽约大学,计算与信息科学研究所); Cleveland State University (克利夫兰州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance across a wide range of Natural Language Processing (NLP) tasks. However, ensuring their effectiveness across multiple languages presents unique challenges. Multilingual prompt engineering has emerged as a key approach to enhance LLMs’ capabilities in diverse linguistic settings without requiring extensive parameter re-training or fine-tuning. With growing interest in multilingual prompt engineering over the past two to three years, researchers have explored various strategies to improve LLMs’ performance across languages and NLP tasks. By crafting structured natural language prompts, researchers have successfully extracted knowledge from LLMs across different languages, making these techniques an accessible pathway for a broader audience, including those without deep expertise in machine learning, to harness the capabilities of LLMs. In this paper, we survey and categorize different multilingual prompting techniques based on the NLP tasks they address across a diverse set of datasets that collectively span around 250 languages. We further highlight the LLMs employed, present a taxonomy of approaches and discuss potential state-of-the-art (SoTA) methods for specific multilingual datasets. Additionally, we derive a range of insights across language families and resource levels (high-resource vs. low-resource), including analyses such as the distribution of NLP tasks by language resource type and the frequency of prompting methods across different language families. Our survey reviews 36 research papers covering 39 prompting techniques applied to 30 multilingual NLP tasks, with the majority of these studies published in the last two years.
zh

[NLP-237] Can an Easy-to-Hard Curriculum Make Reasoning Emerge in Small Language Models? Evidence from a Four-Stage Curriculum on GPT -2

【速读】: 该论文试图解决小语言模型(Small Language Models, SLMs)在推理透明性和样本效率方面存在的不足,其解决方案的关键在于采用一种发育有序的课程学习(developmentally ordered curriculum)。通过在四个阶段的课程中逐步提升任务难度,从词汇匹配到多步骤符号推理,Cognivolve模型在无需任何任务特定微调的情况下,显著提升了优化步骤效率、梯度显著推理头的激活数量以及注意力机制的熵值,从而实现了更好的上下文平衡。实验表明,进度顺序而非额外计算量是产生这些改进的核心因素。

链接: https://arxiv.org/abs/2505.11643
作者: Xiang Fu
机构: Boston University (波士顿大学); Modularium Research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We demonstrate that a developmentally ordered curriculum markedly improves reasoning transparency and sample-efficiency in small language models (SLMs). Concretely, we train Cognivolve, a 124 M-parameter GPT-2 model, on a four-stage syllabus that ascends from lexical matching to multi-step symbolic inference and then evaluate it without any task-specific fine-tuning. Cognivolve reaches target accuracy in half the optimization steps of a single-phase baseline, activates an order-of-magnitude more gradient-salient reasoning heads, and shifts those heads toward deeper layers, yielding higher-entropy attention that balances local and long-range context. The same curriculum applied out of order or with optimizer resets fails to reproduce these gains, confirming that progression–not extra compute–drives the effect. We also identify open challenges: final-answer success still lags a conventional run by about 30%, and our saliency probe under-detects verbal-knowledge heads in the hardest stage, suggesting directions for mixed-stage fine-tuning and probe expansion.
zh

[NLP-238] Critique-Guided Distillation: Improving Supervised Fine-tuning via Better Distillation NEURIPS2025

【速读】: 该论文试图解决监督微调(Supervised Fine-Tuning, SFT)中常见的模仿问题(imitation problem),即模型仅学习复制正确的响应而未理解其背后的逻辑。解决方案的关键在于提出一种名为Critique-Guided Distillation (CGD)的多阶段框架,该框架将教师模型生成的解释性批评(explanatory critiques)和优化后的响应(refined responses)整合到SFT过程中,使学生模型能够同时学习“模仿什么”和“为什么模仿”。通过熵分析,作者证明CGD可以减少优化不确定性,并可被解释为贝叶斯后验更新。实验结果表明,CGD在数学和语言理解任务中均取得了显著提升。

链接: https://arxiv.org/abs/2505.11628
作者: Berkcan Kapusuzoglu,Supriyo Chakraborty,Chia-Hsuan Lee,Sambit Sahu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to NeurIPS 2025

点击查看摘要

Abstract:Supervised fine-tuning (SFT) using expert demonstrations often suffer from the imitation problem, where the model learns to reproduce the correct responses without \emphunderstanding the underlying rationale. To address this limitation, we propose \textscCritique-Guided Distillation (CGD), a novel multi-stage framework that integrates teacher model generated \emphexplanatory critiques and \emphrefined responses into the SFT process. A student model is then trained to map the triplet of prompt, teacher critique, and its own initial response to the corresponding refined teacher response, thereby learning both \emphwhat to imitate and \emphwhy. Using entropy-based analysis, we show that \textscCGD reduces refinement uncertainty and can be interpreted as a Bayesian posterior update. We perform extensive empirical evaluation of \textscCGD, on variety of benchmark tasks, and demonstrate significant gains on both math (AMC23 +17.5%) and language understanding tasks (MMLU-Pro +6.3%), while successfully mitigating the format drift issues observed in previous critique fine-tuning (CFT) techniques.
zh

[NLP-239] HELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering

【速读】: 该论文试图解决基于检索增强生成(Retrieval Augmented Generation, RAG)的问答(QA)应用在端到端评估中的挑战,特别是缺乏标注数据或参考答案的情况下如何进行全面、细粒度的评估。解决方案的关键在于提出THELMA(Task Based Holistic Evaluation of Large Language Model Applications)框架,该框架包含六个相互关联的指标,旨在对RAG QA系统进行整体评估,并通过分析指标间的相互作用识别需要改进的具体RAG组件。

链接: https://arxiv.org/abs/2505.11626
作者: Udita Patel,Rutu Mulkar,Jay Roberts,Cibi Chakravarthy Senthilkumar,Sujay Gandhi,Xiaofei Zheng,Naumaan Nayyar,Rafael Castrillo
机构: Amazon.com Services Inc. (亚马逊公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference this http URL also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.
zh

[NLP-240] Steering Risk Preferences in Large Language Models by Aligning Behavioral and Neural Representations

【速读】: 该论文试图解决如何系统地识别用于改变大型语言模型(Large Language Models, LLMs)行为的“转向向量”(steering vectors)的问题。解决方案的关键在于通过将行为方法(具体为基于LLM的马尔可夫链蒙特卡洛方法)引发的潜在表征与对应的神经表征进行对齐,从而发现有效的转向向量。这种方法无需重新训练或微调模型,即可通过调整内部神经激活来有针对性地影响模型行为。

链接: https://arxiv.org/abs/2505.11615
作者: Jian-Qiao Zhu,Haijiang Yan,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); University of Warwick (华威大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Changing the behavior of large language models (LLMs) can be as straightforward as editing the Transformer’s residual streams using appropriately constructed “steering vectors.” These modifications to internal neural activations, a form of representation engineering, offer an effective and targeted means of influencing model behavior without retraining or fine-tuning the model. But how can such steering vectors be systematically identified? We propose a principled approach for uncovering steering vectors by aligning latent representations elicited through behavioral methods (specifically, Markov chain Monte Carlo with LLMs) with their neural counterparts. To evaluate this approach, we focus on extracting latent risk preferences from LLMs and steering their risk-related outputs using the aligned representations as steering vectors. We show that the resulting steering vectors successfully and reliably modulate LLM outputs in line with the targeted behavior.
zh

[NLP-241] Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

【速读】: 该论文试图解决传统神经网络模型在认知建模中预测性能强但缺乏可解释性的问题(the problem of strong predictive performance but lack of interpretability in traditional neural network models)。其解决方案的关键在于利用预训练的大语言模型(pretrained large language models, LLMs)作为双用途的认知模型,通过基于结果的奖励进行强化学习,引导模型生成用于解释人类风险决策的显式推理轨迹,从而实现准确预测与自然语言可解释性相结合。

链接: https://arxiv.org/abs/2505.11614
作者: Jian-Qiao Zhu,Hanbo Xie,Dilip Arumugam,Robert C. Wilson,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models–capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.
zh

[NLP-242] MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在遵循结构化临床指南(decision trees)方面可靠性不足的问题,特别是在临床决策一致性方面的表现。解决方案的关键在于构建MedGUIDE基准,该基准基于55个经过筛选的NCCN(国家综合癌症网络)决策树,涵盖17种癌症类型,并通过LLM生成的临床情景创建大量多选诊断问题,结合专家标注的奖励模型和LLM作为评判者的集成方法,筛选出高质量样本,从而系统评估LLMs在遵循结构化指南任务中的表现。

链接: https://arxiv.org/abs/2505.11613
作者: Xiaomin Li,Mingye Gao,Yuexing Hao,Taoran Li,Guangya Wan,Zihan Wang,Yijun Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical guidelines, typically structured as decision trees, are central to evidence-based medical practice and critical for ensuring safe and accurate diagnostic decision-making. However, it remains unclear whether Large Language Models (LLMs) can reliably follow such structured protocols. In this work, we introduce MedGUIDE, a new benchmark for evaluating LLMs on their ability to make guideline-consistent clinical decisions. MedGUIDE is constructed from 55 curated NCCN decision trees across 17 cancer types and uses clinical scenarios generated by LLMs to create a large pool of multiple-choice diagnostic questions. We apply a two-stage quality selection process, combining expert-labeled reward models and LLM-as-a-judge ensembles across ten clinical and linguistic criteria, to select 7,747 high-quality samples. We evaluate 25 LLMs spanning general-purpose, open-source, and medically specialized models, and find that even domain-specific LLMs often underperform on tasks requiring structured guideline adherence. We also test whether performance can be improved via in-context guideline inclusion or continued pretraining. Our findings underscore the importance of MedGUIDE in assessing whether LLMs can operate safely within the procedural frameworks expected in real-world clinical settings.
zh

[NLP-243] Probing the Vulnerability of Large Language Models to Polysemantic Interventions

【速读】: 该论文试图解决大型神经网络中神经元的多义性(polysemanticity)问题,即单个神经元编码多个无关特征的现象,这一特性是语言模型可解释性中的核心挑战,同时其对模型安全性的影响也尚未明确。论文的关键解决方案是利用稀疏自编码器(sparse autoencoders)的最新进展,分析小型模型(Pythia-70M 和 GPT-2-Small)的多义性结构,并评估其在提示、特征、标记和神经元层面受到定向隐蔽干预的脆弱性,进而证明该结构可被用于对更大规模的黑盒指令调优模型(LLaMA3.1-8B-Instruct 和 Gemma-2-9B-Instruct)实施有效干预。

链接: https://arxiv.org/abs/2505.11611
作者: Bofan Gong,Shiyang Lai,Dawn Song
机构: University of Chicago (芝加哥大学); UC Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Polysemanticity – where individual neurons encode multiple unrelated features – is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.
zh

[NLP-244] alk to Your Slides: Efficient Slide Editing Agent with Large Language Models

【速读】: 该论文试图解决现有大型语言模型(Large Language Models, LLMs)在PowerPoint应用中主要关注幻灯片生成,而忽视了日常且繁琐的已有幻灯片编辑任务的问题。解决方案的关键在于提出Talk-to-Your-Slides,一个基于LLM的智能代理,通过COM通信直接在活动的PowerPoint会话中编辑幻灯片。该系统采用两级方法:高层处理由LLM代理解析指令并制定编辑计划,低层执行则通过Python脚本直接操作PowerPoint对象,从而实现更灵活和上下文感知的编辑能力。

链接: https://arxiv.org/abs/2505.11604
作者: Kyudan Jung,Hojun Cho,Jooyeol Yun,Jaehyeok Jang,Jagul Choo
机构: Chung-ang University (中앙大学); KAIST AI (KAIST人工智能)
类目: Computation and Language (cs.CL)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:Existing research on large language models (LLMs) for PowerPoint predominantly focuses on slide generation, overlooking the common yet tedious task of editing existing slides. We introduce Talk-to-Your-Slides, an LLM-powered agent that directly edits slides within active PowerPoint sessions through COM communication. Our system employs a two-level approach: (1) high-level processing where an LLM agent interprets instructions and formulates editing plans, and (2) low-level execution where Python scripts directly manipulate PowerPoint objects. Unlike previous methods relying on predefined operations, our approach enables more flexible and contextually-aware editing. To facilitate evaluation, we present TSBench, a human-annotated dataset of 379 diverse editing instructions with corresponding slide variations. Experimental results demonstrate that Talk-to-Your-Slides significantly outperforms baseline methods in execution success rate, instruction fidelity, and editing efficiency. Our code and benchmark are available at this https URL
zh

[NLP-245] Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO

【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在处理所有采样响应均错误的“全负样本组”(all-negative-sample group)时无法更新策略、导致学习停滞的问题。其解决方案的关键在于引入AI反馈以在全负样本组中增加响应多样性,并通过理论分析证明这种多样化有助于改善学习动态。此外,实验验证表明该方法在不同模型规模和学习设置下均能提升性能。

链接: https://arxiv.org/abs/2505.11595
作者: Peter Chen,Xiaopeng Li,Ziniu Li,Xi Chen,Tianyi Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:Reinforcement learning (RL) has demonstrated significant success in enhancing reasoning capabilities in large language models (LLMs). One of the most widely used RL methods is Group Relative Policy Optimization (GRPO)~\citeShao-2024-Deepseekmath, known for its memory efficiency and success in training DeepSeek-R1~\citeGuo-2025-Deepseek. However, GRPO stalls when all sampled responses in a group are incorrect – referred to as an \emphall-negative-sample group – as it fails to update the policy, hindering learning progress. The contributions of this paper are two-fold. First, we propose a simple yet effective framework that introduces response diversity within all-negative-sample groups in GRPO using AI feedback. We also provide a theoretical analysis, via a stylized model, showing how this diversification improves learning dynamics. Second, we empirically validate our approach, showing the improved performance across various model sizes (7B, 14B, 32B) in both offline and online learning settings with 10 benchmarks, including base and distilled variants. Our findings highlight that learning from all-negative-sample groups is not only feasible but beneficial, advancing recent insights from \citetXiong-2025-Minimalist.
zh

[NLP-246] ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems INTERSPEECH2025

【速读】: 该论文试图解决自动语音识别(ASR)系统在不同人口统计群体中表现存在显著差异的问题。其解决方案的关键在于引入ASR-FAIRBENCH排行榜,通过结合Meta的Fair-Speech数据集和混合效应泊松回归模型,计算出一个综合公平性评分,并将其与传统的词错误率(WER)相结合,形成公平调整的ASR评分(FAAS),从而提供一个全面评估ASR模型准确性和公平性的框架。

链接: https://arxiv.org/abs/2505.11572
作者: Anand Rai,Satyam Rahangdale,Utkarsh Anand,Animesh Mukherjee
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Paper accepted at INTERSPEECH 2025

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems have become ubiquitous in everyday applications, yet significant disparities in performance across diverse demographic groups persist. In this work, we introduce the ASR-FAIRBENCH leaderboard which is designed to assess both the accuracy and equity of ASR models in real-time. Leveraging the Meta’s Fair-Speech dataset, which captures diverse demographic characteristics, we employ a mixed-effects Poisson regression model to derive an overall fairness score. This score is integrated with traditional metrics like Word Error Rate (WER) to compute the Fairness Adjusted ASR Score (FAAS), providing a comprehensive evaluation framework. Our approach reveals significant performance disparities in SOTA ASR models across demographic groups and offers a benchmark to drive the development of more inclusive ASR technologies.
zh

[NLP-247] Assessing Collective Reasoning in Multi-Agent LLM s via Hidden Profile Tasks

【速读】: 该论文试图解决多智能体系统在基于大语言模型(Large Language Models, LLMs)的分布式信息整合中可能复现人类群体中的集体推理失败问题,且缺乏理论基础的评估基准。其解决方案的关键在于引入社会心理学中的隐性档案(Hidden Profile)范式作为诊断测试平台,通过不对称分配关键信息以揭示智能体间动态对集体推理的支持或阻碍作用,并将其形式化为涵盖多种场景的基准任务,从而系统评估多智能体系统的性能。

链接: https://arxiv.org/abs/2505.11556
作者: Yuxuan Li,Aoi Naito,Hirokazu Shirado
机构: Carnegie Mellon University (卡内基梅隆大学); Institute of Science Tokyo (东京科学大学); Japan Society for the Promotion of Science (日本学术振兴会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving through distributed information integration, but also risk replicating collective reasoning failures observed in human groups. Yet, no theory-grounded benchmark exists to systematically evaluate such failures. In this paper, we introduce the Hidden Profile paradigm from social psychology as a diagnostic testbed for multi-agent LLM systems. By distributing critical information asymmetrically across agents, the paradigm reveals how inter-agent dynamics support or hinder collective reasoning. We first formalize the paradigm for multi-agent decision-making under distributed knowledge and instantiate it as a benchmark with nine tasks spanning diverse scenarios, including adaptations from prior human studies. We then conduct experiments with GPT-4.1 and five other leading LLMs, including reasoning-enhanced variants, showing that multi-agent systems across all models fail to match the accuracy of single agents given complete information. While agents’ collective performance is broadly comparable to that of human groups, nuanced behavioral differences emerge, such as increased sensitivity to social desirability. Finally, we demonstrate the paradigm’s diagnostic utility by exploring a cooperation-contradiction trade-off in multi-agent LLM systems. We find that while cooperative agents are prone to over-coordination in collective settings, increased contradiction impairs group convergence. This work contributes a reproducible framework for evaluating multi-agent LLM systems and motivates future research on artificial collective intelligence and human-AI interaction.
zh

[NLP-248] AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification

【速读】: 该论文旨在解决AI生成文本的检测与溯源问题,具体包括区分人类撰写文本与AI生成文本(Task A)以及识别文本所属的语言模型(Task B)。其解决方案的关键在于提出两种神经网络架构,分别针对两个子任务进行优化与简化,其中优化模型在Task A中取得了F1分数0.994的第五名成绩,而简化模型在Task B中也获得了F1分数0.627的第五名成绩,展示了其在实际应用中的有效性与可行性。

链接: https://arxiv.org/abs/2505.11550
作者: Harika Abburi,Sanmitra Bhattacharya,Edward Bowen,Nirmala Pudota
机构: Deloitte & Touche Assurance and Enterprise Risk Services India Private Limited, India; Deloitte & Touche LLP, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing across a wide range of styles and genres. However, such capabilities are prone to potential misuse, such as fake news generation, spam email creation, and misuse in academic assignments. As a result, accurate detection of AI-generated text and identification of the model that generated it are crucial for maintaining the responsible use of LLMs. In this work, we addressed two sub-tasks put forward by the Defactify workshop under AI-Generated Text Detection shared task at the Association for the Advancement of Artificial Intelligence (AAAI 2025): Task A involved distinguishing between human-authored or AI-generated text, while Task B focused on attributing text to its originating language model. For each task, we proposed two neural architectures: an optimized model and a simpler variant. For Task A, the optimized neural architecture achieved fifth place with F1 score of 0.994, and for Task B, the simpler neural architecture also ranked fifth place with F1 score of 0.627.
zh

[NLP-249] ARGET: Benchmarking Table Retrieval for Generative Tasks

【速读】: 该论文旨在解决在结构化数据中如何检索出与分析查询或任务相关的正确表的问题。其关键解决方案是引入TARGET:一个用于评估生成任务中TAble Retrieval(表检索)的基准。通过TARGET,作者分析了不同检索器在孤立情况下的性能及其对下游任务的影响,发现基于密集嵌入的检索器显著优于BM25基线,并揭示了检索器在不同元数据和数据集任务中的敏感性差异。

链接: https://arxiv.org/abs/2505.11545
作者: Xingyu Ji,Parker Glenn,Aditya G. Parameswaran,Madelon Hulsebos
机构: UC Berkeley (加州大学伯克利分校); Capital One (资本one); CWI (荷兰数学与计算机科学研究所)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at this https URL.
zh

[NLP-250] A Data Synthesis Method Driven by Large Language Models for Proactive Mining of Implicit User Intentions in Tourism

【速读】: 该论文旨在解决旅游领域中大型语言模型(Large Language Models, LLMs)在挖掘用户隐含意图方面的不足,以及缺乏主动引导用户明确需求的能力。其关键解决方案是提出一种基于LLM的数据合成方法——SynPT(A Data Synthesis Method Driven by LLMs for Proactive Mining of Implicit User Intentions in the Tourism),通过构建LLM驱动的用户代理和助手代理,模拟基于从中文旅游网站收集的种子数据的对话,生成包含显式推理的训练数据集SynPT-Dialog,从而提升模型主动挖掘用户隐含意图的能力。

链接: https://arxiv.org/abs/2505.11533
作者: Jinqiang Wang,Huansheng Ning,Tao Zhu,Jianguo Ding
机构: School of Computer & Communication Engineering, University of Science and Technology Beijing (计算机与通信工程学院,北京科技大学); School of Computer Science, University of South China (计算机科学学院,南华大学); Department of Computer Science, Blekinge institute of Technology (计算机科学系,布莱金厄技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the tourism domain, Large Language Models (LLMs) often struggle to mine implicit user intentions from tourists’ ambiguous inquiries and lack the capacity to proactively guide users toward clarifying their needs. A critical bottleneck is the scarcity of high-quality training datasets that facilitate proactive questioning and implicit intention mining. While recent advances leverage LLM-driven data synthesis to generate such datasets and transfer specialized knowledge to downstream models, existing approaches suffer from several shortcomings: (1) lack of adaptation to the tourism domain, (2) skewed distributions of detail levels in initial inquiries, (3) contextual redundancy in the implicit intention mining module, and (4) lack of explicit thinking about tourists’ emotions and intention values. Therefore, we propose SynPT (A Data Synthesis Method Driven by LLMs for Proactive Mining of Implicit User Intentions in the Tourism), which constructs an LLM-driven user agent and assistant agent to simulate dialogues based on seed data collected from Chinese tourism websites. This approach addresses the aforementioned limitations and generates SynPT-Dialog, a training dataset containing explicit reasoning. The dataset is utilized to fine-tune a general LLM, enabling it to proactively mine implicit user intentions. Experimental evaluations, conducted from both human and LLM perspectives, demonstrate the superiority of SynPT compared to existing methods. Furthermore, we analyze key hyperparameters and present case studies to illustrate the practical applicability of our method, including discussions on its adaptability to English-language scenarios. All code and data are publicly available.
zh

[NLP-251] SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information INTERSPEECH2025

【速读】: 该论文试图解决大型音频语言模型(Large Audio-Language Models, LALMs)在多跳推理能力方面的不足,即模型在回忆和整合多个事实以进行复杂推理任务时表现不佳的问题。现有基准测试主要关注语音和音频处理任务、对话能力及公平性,而忽视了多跳推理这一关键方面。为解决此问题,论文提出了SAKURA基准,用于评估LALMs基于语音和音频信息的多跳推理能力。解决方案的关键在于构建一个系统化的评估框架,以揭示LALMs在融合语音/音频表征进行多跳推理中的根本性挑战。

链接: https://arxiv.org/abs/2505.13237
作者: Chih-Kai Yang,Neo Ho,Yen-Ting Piao,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs’ multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.
zh

[NLP-252] Efficient Generation of Parameterised Quantum Circuits from Large Texts

【速读】: 该论文旨在解决如何高效且准确地将大规模文本编码为参数化量子电路(PQCs)的问题,以支持量子自然语言处理(Quantum NLP)中的语义和句法关系建模。其解决方案的关键在于利用预积图(pregroup diagrams)的树状表示,并基于对称单子范畴(symmetric monoidal categories)中语言与量子力学的组合相似性,实现对长而复杂文本(最长达6410词)的忠实编码。该方法不仅提升了编码效率,还增强了模型的可解释性和组合性。

链接: https://arxiv.org/abs/2505.13208
作者: Colin Krawchuk,Nikhil Khatri,Neil John Ortega,Dimitri Kartsaklis
机构: Quantinuum(量子云)
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantum approaches to natural language processing (NLP) are redefining how linguistic information is represented and processed. While traditional hybrid quantum-classical models rely heavily on classical neural networks, recent advancements propose a novel framework, DisCoCirc, capable of directly encoding entire documents as parameterised quantum circuits (PQCs), besides enjoying some additional interpretability and compositionality benefits. Following these ideas, this paper introduces an efficient methodology for converting large-scale texts into quantum circuits using tree-like representations of pregroup diagrams. Exploiting the compositional parallels between language and quantum mechanics, grounded in symmetric monoidal categories, our approach enables faithful and efficient encoding of syntactic and discourse relationships in long and complex texts (up to 6410 words in our experiments) to quantum circuits. The developed system is provided to the community as part of the augmented open-source quantum NLP package lambeq Gen II.
zh

[NLP-253] Vague Knowledge: Evidence from Analyst Reports

【速读】: 该论文试图解决现实世界中个体对未来收益的模糊知识难以量化的问题,这种知识的不可量化性使得传统数值预测方法无法充分捕捉其信息价值。论文的解决方案之关键在于强调语言在主观预期形成中的作用,特别是语言在传达模糊信息方面的独特能力。研究发现,分析师在报告中通过语言表达传递了有用的信息,而这些信息并未体现在数值预测中,且文本语气对预测误差和后续数值预测的修正具有预测能力,尤其在语言更模糊、不确定性更高或分析师更忙碌的情况下,这种关系更为显著。

链接: https://arxiv.org/abs/2505.12269
作者: Kerry Xiao,Amy Zang
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic (math.LO); General Finance (q-fin.GN)
备注:

点击查看摘要

Abstract:People in the real world often possess vague knowledge of future payoffs, for which quantification is not feasible or desirable. We argue that language, with differing ability to convey vague information, plays an important but less known-role in subjective expectations. Empirically, we find that in their reports, analysts include useful information in linguistic expressions but not numerical forecasts. Specifically, the textual tone of analyst reports has predictive power for forecast errors and subsequent revisions in numerical forecasts, and this relation becomes stronger when analyst’s language is vaguer, when uncertainty is higher, and when analysts are busier. Overall, our theory and evidence suggest that some useful information is vaguely known and only communicated through language.
zh

计算机视觉

[CV-0] Mean Flows for One-step Generative Modeling

【速读】:该论文旨在解决单步生成建模(one-step generative modeling)中模型性能不足的问题,特别是与多步扩散/流模型相比存在的差距。其解决方案的关键在于引入平均速度(average velocity)概念来表征流场,而非传统Flow Matching方法所采用的瞬时速度(instantaneous velocity),并通过推导平均速度与瞬时速度之间的明确关系来指导神经网络的训练。该方法被称为MeanFlow模型,具有自包含性,无需预训练、蒸馏或课程学习,且在ImageNet 256x256数据集上仅需一次函数评估(1-NFE)即达到3.43的FID分数,显著优于现有单步扩散/流模型。

链接: https://arxiv.org/abs/2505.13447
作者: Zhengyang Geng,Mingyang Deng,Xingjian Bai,J. Zico Kolter,Kaiming He
机构: CMU (卡内基梅隆大学); MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report

点击查看摘要

Abstract:We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256x256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.
zh

[CV-1] Recollection from Pensieve: Novel View Synthesis via Learning from Uncalibrated Videos

【速读】:该论文试图解决当前大多数新颖视角合成与重建模型依赖校准相机或额外几何先验的问题,这些问题显著限制了其在大规模未校准数据中的适用性。解决方案的关键在于提出一种两阶段策略,第一阶段通过隐式重建场景并预测每帧的潜在相机和场景上下文特征,以自监督方式学习三维一致性;第二阶段则通过显式预测三维高斯基元,并引入显式高斯点云渲染损失和深度投影损失,以对齐潜在表示与物理基础的三维几何。这种分阶段的方法有效提升了模型的三维一致性与泛化能力。

链接: https://arxiv.org/abs/2505.13440
作者: Ruoyu Wang,Yi Ma,Shenghua Gao
机构: Transcengram(跨图灵); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Currently almost all state-of-the-art novel view synthesis and reconstruction models rely on calibrated cameras or additional geometric priors for training. These prerequisites significantly limit their applicability to massive uncalibrated data. To alleviate this requirement and unlock the potential for self-supervised training on large-scale uncalibrated videos, we propose a novel two-stage strategy to train a view synthesis model from only raw video frames or multi-view images, without providing camera parameters or other priors. In the first stage, we learn to reconstruct the scene implicitly in a latent space without relying on any explicit 3D representation. Specifically, we predict per-frame latent camera and scene context features, and employ a view synthesis model as a proxy for explicit rendering. This pretraining stage substantially reduces the optimization complexity and encourages the network to learn the underlying 3D consistency in a self-supervised manner. The learned latent camera and implicit scene representation have a large gap compared with the real 3D world. To reduce this gap, we introduce the second stage training by explicitly predicting 3D Gaussian primitives. We additionally apply explicit Gaussian Splatting rendering loss and depth projection loss to align the learned latent representations with physically grounded 3D geometry. In this way, Stage 1 provides a strong initialization and Stage 2 enforces 3D consistency - the two stages are complementary and mutually beneficial. Extensive experiments demonstrate the effectiveness of our approach, achieving high-quality novel view synthesis and accurate camera pose estimation, compared to methods that employ supervision with calibration, pose, or depth information. The code is available at this https URL.
zh

[CV-2] VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation

【速读】:该论文试图解决当前离散视觉分词器(VT)在图像生成中的性能不足问题,尤其是其在图像重建、细节保留和文本保留方面的表现远低于连续变分自编码器(VAE),导致图像重构质量下降和细节丢失。解决方案的关键在于提出VTBench,一个全面的基准测试框架,系统评估VT在三个核心任务中的表现,并通过一系列指标评估重构图像的质量,从而揭示连续VAEs在保留空间结构和语义细节方面的优势,为后续研究提供参考。

链接: https://arxiv.org/abs/2505.13439
作者: Huawei Lin,Tong Geng,Zhaozhuo Xu,Weijie Zhao
机构: Rochester Institute of Technology (罗彻斯特理工学院); University of Rochester (罗切斯特大学); Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 13 figures, 3 tables

点击查看摘要

Abstract:Autoregressive (AR) models have recently shown strong performance in image generation, where a critical component is the visual tokenizer (VT) that maps continuous pixel inputs to discrete token sequences. The quality of the VT largely defines the upper bound of AR model performance. However, current discrete VTs fall significantly behind continuous variational autoencoders (VAEs), leading to degraded image reconstructions and poor preservation of details and text. Existing benchmarks focus on end-to-end generation quality, without isolating VT performance. To address this gap, we introduce VTBench, a comprehensive benchmark that systematically evaluates VTs across three core tasks: Image Reconstruction, Detail Preservation, and Text Preservation, and covers a diverse range of evaluation scenarios. We systematically assess state-of-the-art VTs using a set of metrics to evaluate the quality of reconstructed images. Our findings reveal that continuous VAEs produce superior visual representations compared to discrete VTs, particularly in retaining spatial structure and semantic detail. In contrast, the degraded representations produced by discrete VTs often lead to distorted reconstructions, loss of fine-grained textures, and failures in preserving text and object integrity. Furthermore, we conduct experiments on GPT-4o image generation and discuss its potential AR nature, offering new insights into the role of visual tokenization. We release our benchmark and codebase publicly to support further research and call on the community to develop strong, general-purpose open-source VTs.
zh

[CV-3] FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance CVPR2025

【速读】:该论文试图解决视频生成中合成物理上合理的精细粒度人体动作仍是一个持续性挑战的问题,特别是在建模细粒度语义和复杂时间动态方面。其解决方案的关键在于提出一种名为FinePhys的细粒度人体动作生成框架,该框架结合物理机制以获得有效的骨骼引导。具体而言,FinePhys首先在线估计2D姿态,然后通过上下文学习实现2D到3D维度提升,并引入基于物理的运动重估计模块,该模块由欧拉-拉格朗日方程控制,通过双向时间更新计算关节加速度,从而提升3D姿态的稳定性与可解释性。

链接: https://arxiv.org/abs/2505.13437
作者: Dian Shao,Mingfei Shi,Shengda Xu,Haodong Chen,Yongle Huang,Binglu Wang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025

点击查看摘要

Abstract:Despite significant advances in video generation, synthesizing physically plausible human actions remains a persistent challenge, particularly in modeling fine-grained semantics and complex temporal dynamics. For instance, generating gymnastics routines such as “switch leap with 0.5 turn” poses substantial difficulties for current methods, often yielding unsatisfactory results. To bridge this gap, we propose FinePhys, a Fine-grained human action generation framework that incorporates Physics to obtain effective skeletal guidance. Specifically, FinePhys first estimates 2D poses in an online manner and then performs 2D-to-3D dimension lifting via in-context learning. To mitigate the instability and limited interpretability of purely data-driven 3D poses, we further introduce a physics-based motion re-estimation module governed by Euler-Lagrange equations, calculating joint accelerations via bidirectional temporal updating. The physically predicted 3D poses are then fused with data-driven ones, offering multi-scale 2D heatmap guidance for the diffusion process. Evaluated on three fine-grained action subsets from FineGym (FX-JUMP, FX-TURN, and FX-SALTO), FinePhys significantly outperforms competitive baselines. Comprehensive qualitative results further demonstrate FinePhys’s ability to generate more natural and plausible fine-grained human actions.
zh

[CV-4] KinTwin: Imitation Learning with Torque and Muscle Driven Biomechanical Models Enables Precise Replication of Able-Bodied and Impaired Movement from Markerless Motion Capture

【速读】:该论文试图解决如何通过模仿学习(imitation learning)从运动数据中推断出运动的物理基础问题,例如地面反作用力、关节扭矩或肌肉激活等,从而实现高质量的运动分析。解决方案的关键在于利用一个精确的生物力学模型,并基于来自健全人和功能障碍个体的大规模运动数据训练模仿学习策略,以准确复现多种运动模式并推断出具有临床意义的运动参数。

链接: https://arxiv.org/abs/2505.13436
作者: R. James Cotton
机构: Northwestern University (西北大学); Shirley Ryan AbilityLab (Shirley Ryan能力实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Broader access to high-quality movement analysis could greatly benefit movement science and rehabilitation, such as allowing more detailed characterization of movement impairments and responses to interventions, or even enabling early detection of new neurological conditions or fall risk. While emerging technologies are making it easier to capture kinematics with biomechanical models, or how joint angles change over time, inferring the underlying physics that give rise to these movements, including ground reaction forces, joint torques, or even muscle activations, is still challenging. Here we explore whether imitation learning applied to a biomechanical model from a large dataset of movements from able-bodied and impaired individuals can learn to compute these inverse dynamics. Although imitation learning in human pose estimation has seen great interest in recent years, our work differences in several ways: we focus on using an accurate biomechanical model instead of models adopted for computer vision, we test it on a dataset that contains participants with impaired movements, we reported detailed tracking metrics relevant for the clinical measurement of movement including joint angles and ground contact events, and finally we apply imitation learning to a muscle-driven neuromusculoskeletal model. We show that our imitation learning policy, KinTwin, can accurately replicate the kinematics of a wide range of movements, including those with assistive devices or therapist assistance, and that it can infer clinically meaningful differences in joint torques and muscle activations. Our work demonstrates the potential for using imitation learning to enable high-quality movement analysis in clinical practice.
zh

[CV-5] Understanding Complexity in VideoQA via Visual Program Generation

【速读】:该论文试图解决视频问答(VideoQA)中查询复杂性分析的问题,即如何准确评估问题对机器学习模型的难度。传统方法依赖人类专家设计具有挑战性的问题,但实验表明人类难以准确预测哪些问题对模型而言是困难的。论文提出的解决方案关键在于利用视觉问答中的代码生成技术,将生成代码的复杂度作为问题难度的代理指标,该指标与模型性能的相关性显著优于人类估计。通过从代码中提取细粒度的语义原语,该方法能够有效衡量问题复杂性,并具备良好的可扩展性。

链接: https://arxiv.org/abs/2505.13429
作者: Cristobal Eyzaguirre,Igor Vasiljevic,Achal Dave,Jiajun Wu,Rares Andrei Ambrus,Thomas Kollar,Juan Carlos Niebles,Pavel Tokmakov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.
zh

[CV-6] MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂多步骤推理任务中逻辑不一致或部分正确的问题,其核心挑战在于缺乏对中间推理步骤的细粒度监督。解决方案的关键是提出MM-PRM,一个在全自动化、可扩展框架中训练的过程奖励模型(Process Reward Model, PRM),通过利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)生成大量步骤级标注数据,并结合软标签、较小的学习率和路径多样性优化PRM性能,从而显著提升模型在领域内和领域外基准上的表现。

链接: https://arxiv.org/abs/2505.13427
作者: Lingxiao Du,Fanqing Meng,Zongkai Liu,Zhixiang Zhou,Ping Luo,Qiaosheng Zhang,Wenqi Shao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at this https URL.
zh

[CV-7] G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在交互式、视觉丰富的环境中(如游戏)难以将自身的多模态能力转化为有效决策的问题,即“知道-做到”差距。其解决方案的关键在于引入VLM-Gym,一个经过精心设计的强化学习(Reinforcement Learning, RL)环境,提供多样化的视觉游戏、统一接口以及可调整的组合难度,以支持可扩展的多游戏并行训练。通过VLM-Gym,研究人员训练了G0模型,这些模型通过纯强化学习驱动的自我进化展现出涌现的感知与推理模式,并进一步通过G1模型优化,引入感知增强的冷启动先验以应对游戏多样性带来的挑战。

链接: https://arxiv.org/abs/2505.13426
作者: Liang Chen,Hongcheng Gao,Tianyu Liu,Zhiqi Huang,Flood Sung,Xinyu Zhou,Yuxin Wu,Baobao Chang
机构: Peking University (北京大学); UCAS (中国科学院大学); Moonshot AI (月神人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 14 figures, code released at this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing’’ gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at this https URL to foster future research in advancing VLMs as capable interactive agents.
zh

[CV-8] FEALLM : Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning

【速读】:该论文旨在解决传统面部情感分析(Facial Emotion Analysis, FEA)方法在可解释性、泛化能力和推理能力方面的局限性,以及多模态大语言模型(Multimodal Large Language Models, MLLMs)在FEA任务中因缺乏专用数据集和难以捕捉面部表情(Facial Expressions, FEs)与动作单元(Action Units, AUs)之间复杂关系而面临的挑战。其解决方案的关键在于构建一个全新的FEA指令数据集,提供准确且对齐的FE和AU描述,并建立它们之间的因果推理关系,同时提出一种新的MLLM架构FEALLM,以更精细地捕捉面部信息,从而提升FEA任务的性能与泛化能力。

链接: https://arxiv.org/abs/2505.13419
作者: Zhuozhao Hu,Kaishen Yuan,Xin Liu,Zitong Yu,Yuan Zong,Jingang Shi,Huanjing Yue,Jingyu Yang
机构: Tianjin University (天津大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-劳里亚理工大学 LUT); Great Bay University (大湾大学); Southeast University (东南大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Facial Emotion Analysis (FEA) plays a crucial role in visual affective computing, aiming to infer a person’s emotional state based on facial data. Scientifically, facial expressions (FEs) result from the coordinated movement of facial muscles, which can be decomposed into specific action units (AUs) that provide detailed emotional insights. However, traditional methods often struggle with limited interpretability, constrained generalization and reasoning abilities. Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. Moreover, we propose FEALLM, a novel MLLM architecture designed to capture more detailed facial information, enhancing its capability in FEA tasks. Our model demonstrates strong performance on FEABench and impressive generalization capability through zero-shot evaluation on various datasets, including RAF-DB, AffectNet, BP4D, and DISFA, showcasing its robustness and effectiveness in FEA tasks. The dataset and code will be available at this https URL.
zh

[CV-9] Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks IJCAI2025

【速读】:该论文旨在解决抽象视觉推理(Abstract Visual Reasoning, AVR)领域中模型在分布外(out-of-distribution, o.o.d.)设置下的泛化能力不足问题。尽管在独立同分布(i.i.d.)场景下模型性能已有显著提升,但面对新的测试分布时,现有模型仍面临挑战。论文提出的解决方案是构建一种新型神经网络架构——归一化组卷积路径模型(Pathways of Normalized Group Convolution model, PoNG),其关键在于引入组卷积、归一化机制以及并行设计,从而增强模型的泛化能力。实验结果表明,该模型在多个AVR基准测试中表现出色,优于现有方法。

链接: https://arxiv.org/abs/2505.13391
作者: Mikołaj Małkiński,Jacek Mańdziuk
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:The abstract visual reasoning (AVR) domain presents a diverse suite of analogy-based tasks devoted to studying model generalization. Recent years have brought dynamic progress in the field, particularly in i.i.d. scenarios, in which models are trained and evaluated on the same data distributions. Nevertheless, o.o.d. setups that assess model generalization to new test distributions remain challenging even for the most recent models. To advance generalization in AVR tasks, we present the Pathways of Normalized Group Convolution model (PoNG), a novel neural architecture that features group convolution, normalization, and a parallel design. We consider a wide set of AVR benchmarks, including Raven’s Progressive Matrices and visual analogy problems with both synthetic and real-world images. The experiments demonstrate strong generalization capabilities of the proposed model, which in several settings outperforms the existing literature methods.
zh

[CV-10] Faster Video Diffusion with Trainable Sparse Attention

【速读】:该论文旨在解决视频扩散变压器(Video Diffusion Transformers, DiTs)在扩展过程中受限于二次3D注意力机制的问题,尽管大部分注意力质量集中在一小部分位置上。其解决方案的关键在于提出一种可训练的、硬件高效的稀疏注意力机制(VSA),该机制在训练和推理阶段均替代全注意力。VSA通过轻量级粗粒度阶段将标记池化为块并识别高权重的关键标记,随后在细粒度阶段仅在这些块内计算标记级别的注意力,以块计算布局确保硬件效率。这一方法实现了单一可微内核,支持端到端训练,无需后期分析,并保持了85%的FlashAttention3 MFU性能。

链接: https://arxiv.org/abs/2505.13389
作者: Peiyuan Zhang,Haofeng Huang,Yongqi Chen,Will Lin,Zhengzhong Liu,Ion Stoica,Eric P. Xing,Hao Zhang
机构: UC San Diego(加州大学圣地亚哥分校); MBZUAI(MBZUAI); UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emphboth training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emphcritical tokens; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53 \times with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6 \times and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.
zh

[CV-11] RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

【速读】:该论文试图解决视频动作迁移(video motion transfer)问题,即如何将参考视频中的运动信息有效地转移到生成视频中,同时保持与文本提示的一致性并避免重复生成。解决方案的关键在于通过修改扩散变换器中的旋转位置编码(RoPE),利用参考视频的密集光流提取运动偏移,并将其应用于RoPE的复数张量,从而在生成过程中编码运动信息;随后在去噪时间步通过流匹配目标进行轨迹对齐优化,最后引入基于参考视频傅里叶变换相位成分的正则化项以提升生成质量。

链接: https://arxiv.org/abs/2505.13344
作者: Ahmet Berke Gokmen,Yigit Ekin,Bahri Batuhan Bilecen,Aysegul Dundar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video’s Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.
zh

[CV-12] Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning

【速读】:该论文旨在解决统一的面部攻击检测(Unified Face Attack Detection, UAD)模型在同时应对物理介质攻击和数字编辑攻击时存在的不足,这些问题主要源于缺乏足够的基准数据集和可靠的分类标准。其解决方案的关键在于提出UniAttackData+数据集,这是目前最全面且复杂的伪造技术集合,包含2,875个身份及其54种伪造样本,共计697,347个视频,以丰富模型的训练数据;同时引入基于视觉-语言模型的分层提示调优框架(HiPTune),通过构建视觉提示树并自适应地剪枝提示,使模型能够在不同语义空间中探索多种分类标准,从而提升对多样攻击的检测能力。

链接: https://arxiv.org/abs/2505.13327
作者: Ajian Liu,Haocheng Yuan,Xiao Guo,Hui Ma,Wanyi Zhuang,Changtao Miao,Yan Hong,Chuanbiao Song,Jun Lan,Qi Chu,Tao Gong,Yanyan Liang,Weiqiang Wang,Jun Wan,Xiaoming Liu,Zhen Lei
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA); Macau University of Science and Technology; Michigan State University; University of Science and Technology of China; Alibaba Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Presentation Attack Detection and Face Forgery Detection are designed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes respectively. But separate training of these two models makes them vulnerable to unknown attacks and burdens deployment environments. The lack of a Unified Face Attack Detection model to handle both types of attacks is mainly due to two factors. First, there’s a lack of adequate benchmarks for models to explore. Existing UAD datasets have limited attack types and samples, restricting the model’s ability to address advanced threats. To address this, we propose UniAttackDataPlus (UniAttackData+), the most extensive and sophisticated collection of forgery techniques to date. It includes 2,875 identities and their 54 kinds of falsified samples, totaling 697,347 videos. Second, there’s a lack of a reliable classification criterion. Current methods try to find an arbitrary criterion within the same semantic space, which fails when encountering diverse attacks. So, we present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework (HiPTune) that adaptively explores multiple classification criteria from different semantic spaces. We build a Visual Prompt Tree to explore various classification rules hierarchically. Then, by adaptively pruning the prompts, the model can select the most suitable prompts to guide the encoder to extract discriminative features at different levels in a coarse-to-fine way. Finally, to help the model understand the classification criteria in visual space, we propose a Dynamically Prompt Integration module to project the visual prompts to the text encoder for more accurate semantics. Experiments on 12 datasets have shown the potential to inspire further innovations in the UAD field.
zh

[CV-13] VesselGPT : Autoregressive Modeling of Vascular Geometry

【速读】:该论文旨在解决解剖树(anatomical trees)在临床诊断和治疗规划中的准确表示问题,尤其是其复杂且多样的几何结构带来的挑战。解决方案的关键在于引入一种基于自回归(autoregressive)的方法,通过使用VQ-VAE架构将血管结构嵌入到一个学习的离散词汇表中,随后利用GPT-2模型对其进行自回归生成。该方法能够有效捕捉复杂的几何形态和分支模式,从而实现逼真的血管树合成。

链接: https://arxiv.org/abs/2505.13318
作者: Paula Feldman,Martin Sinnona,Viviana Siless,Claudio Delrieux,Emmanuel Iarussi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Anatomical trees are critical for clinical diagnosis and treatment planning, yet their complex and diverse geometry make accurate representation a significant challenge. Motivated by the latest advances in large language models, we introduce an autoregressive method for synthesizing anatomical trees. Our approach first embeds vessel structures into a learned discrete vocabulary using a VQ-VAE architecture, then models their generation autoregressively with a GPT-2 model. This method effectively captures intricate geometries and branching patterns, enabling realistic vascular tree synthesis. Comprehensive qualitative and quantitative evaluations reveal that our technique achieves high-fidelity tree reconstruction with compact discrete representations. Moreover, our B-spline representation of vessel cross-sections preserves critical morphological details that are often overlooked in previous’ methods parameterizations. To the best of our knowledge, this work is the first to generate blood vessels in an autoregressive manner. Code, data, and trained models will be made available.
zh

[CV-14] Denoising Diffusion Probabilistic Model for Point Cloud Compression at Low Bit-Rates ICME2025

【速读】:该论文旨在解决低比特率下点云高效压缩的问题(low-bit-rate point cloud compression),这一问题在带宽受限的应用中尤为关键。现有方法主要关注高保真重建,导致压缩所需比特数较高。论文提出的解决方案是基于“Denoising Diffusion Probabilistic Model”(DDPM)的点云压缩架构(DDPM-PCC),其关键在于使用PointNet编码器生成生成过程的条件向量,并通过可学习的矢量量化器对其进行量化,从而在保持质量的同时实现低比特率压缩。

链接: https://arxiv.org/abs/2505.13316
作者: Gabriele Spadaro,Alberto Presta,Jhony H. Giraldo,Marco Grangetto,Wei Hu,Giuseppe Valenzise,Attilio Fiandrotti,Enzo Tartaglione
机构: LTCI, Télécom Paris, Institut Polytechnique de Paris, France (LTCI, 法国电信巴黎,巴黎综合理工学院); University of Turin, Italy (都灵大学, 意大利); Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所,北京大学); Université Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des Signaux et Systèmes (巴黎-萨克雷大学, 法国国家科学研究中心, 中央理工-巴黎高科, 信号与系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 5 figures, accepted at ICME 2025

点击查看摘要

Abstract:Efficient compression of low-bit-rate point clouds is critical for bandwidth-constrained applications. However, existing techniques mainly focus on high-fidelity reconstruction, requiring many bits for compression. This paper proposes a “Denoising Diffusion Probabilistic Model” (DDPM) architecture for point cloud compression (DDPM-PCC) at low bit-rates. A PointNet encoder produces the condition vector for the generation, which is then quantized via a learnable vector quantizer. This configuration allows to achieve a low bitrates while preserving quality. Experiments on ShapeNet and ModelNet40 show improved rate-distortion at low rates compared to standardized and state-of-the-art approaches. We publicly released the code at this https URL.
zh

[CV-15] Stonefish-scenes: A synthetically generated dataset for underwater event-based optical flow prediction tasks

【速读】:该论文试图解决事件视觉(event-based vision)与脉冲神经网络(Spiking Neural Networks, SNNs)在水下应用中缺乏标注数据集的问题,这限制了其在自主水下航行器(Autonomous Underwater Vehicles, AUVs)中的集成。解决方案的关键在于提出eStonefish-scenes合成事件光流数据集,并配套数据生成流水线,以创建可定制的水下环境,同时引入eWiz库以简化事件数据的处理与分析。

链接: https://arxiv.org/abs/2505.13309
作者: Jad Mansour,Sebastian Realpe,Hayat Rajani,Michele Grimaldi,Rafael Garcia,Nuno Gracias
机构: Computer Vision and Robotics Research Institute (ViCOROB), Universitat de Girona (UdG), Spain; School of Engineering & Physical Sciences, Heriot-Watt University, Edinburgh, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IJRR

点击查看摘要

Abstract:The combined use of event-based vision and Spiking Neural Networks (SNNs) is expected to significantly impact robotics, particularly in tasks like visual odometry and obstacle avoidance. While existing real-world event-based datasets for optical flow prediction, typically captured with Unmanned Aerial Vehicles (UAVs), offer valuable insights, they are limited in diversity, scalability, and are challenging to collect. Moreover, there is a notable lack of labelled datasets for underwater applications, which hinders the integration of event-based vision with Autonomous Underwater Vehicles (AUVs). To address this, synthetic datasets could provide a scalable solution while bridging the gap between simulation and reality. In this work, we introduce eStonefish-scenes, a synthetic event-based optical flow dataset based on the Stonefish simulator. Along with the dataset, we present a data generation pipeline that enables the creation of customizable underwater environments. This pipeline allows for simulating dynamic scenarios, such as biologically inspired schools of fish exhibiting realistic motion patterns, including obstacle avoidance and reactive navigation around corals. Additionally, we introduce a scene generator that can build realistic reef seabeds by randomly distributing coral across the terrain. To streamline data accessibility, we present eWiz, a comprehensive library designed for processing event-based data, offering tools for data loading, augmentation, visualization, encoding, and training data generation, along with loss functions and performance metrics.
zh

[CV-16] GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval

【速读】:该论文旨在解决少样本跨模态检索中由于训练样本有限导致的潜在语义空间中的内模态偏差和跨模态偏差问题,这些问题限制了检索的准确性。其解决方案的关键在于提出一种名为GCRDP的方法,该方法通过高斯混合模型(GMM)有效建模数据的多峰分布,并引入多正样本对比学习机制进行全面特征建模,同时提出一种新的跨模态语义对齐策略,以约束图像与文本特征分布之间的相对距离,从而提升跨模态表示的准确性。

链接: https://arxiv.org/abs/2505.13306
作者: Chengsong Sun,Weiping Li,Xiang Li,Yuankun Liu,Lianlei Shan
机构: School of Software and Microelectronics, Peking University (北京大学软件与微电子学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples, enabling the model to handle unseen classes during inference. Unlike traditional cross-modal retrieval tasks, which assume that both training and testing data share the same class distribution, few-shot retrieval involves data with sparse representations across modalities. Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data, resulting in two main biases in the latent semantic space: intra-modal bias, where sparse samples fail to capture intra-class diversity, and inter-modal bias, where misalignments between image and text distributions exacerbate the semantic gap. These biases hinder retrieval accuracy. To address these issues, we propose a novel method, GCRDP, for few-shot cross-modal retrieval. This approach effectively captures the complex multi-peak distribution of data using a Gaussian Mixture Model (GMM) and incorporates a multi-positive sample contrastive learning mechanism for comprehensive feature modeling. Additionally, we introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions, thereby improving the accuracy of cross-modal representations. We validate our approach through extensive experiments on four benchmark datasets, demonstrating superior performance over six state-of-the-art methods.
zh

[CV-17] DD-Ranking: Rethinking the Evaluation of Dataset Distillation

【速读】:该论文试图解决当前数据集蒸馏(Dataset Distillation, DD)评估体系中准确性(accuracy)作为评价指标的可靠性问题。研究表明,现有方法的性能提升往往源于额外的技术手段而非数据本身的质量,甚至随机采样的图像也能取得较好效果,这表明现有的评估设置存在偏差,阻碍了DD技术的发展。论文提出的解决方案关键在于提出DD-Ranking,一个统一的评估框架及新的通用评估指标,旨在通过关注蒸馏数据集的实际信息增强,提供更全面和公平的评价标准。

链接: https://arxiv.org/abs/2505.13300
作者: Zekai Li,Xinhao Zhong,Samir Khaki,Zhiyuan Liang,Yuhao Zhou,Mingjia Shi,Ziqiao Wang,Xuanlei Zhao,Wangbo Zhao,Ziheng Qin,Mengxuan Wu,Pengfei Zhou,Haonan Wang,David Junhao Zhang,Jia-Wei Liu,Shaobo Wang,Dai Liu,Linfeng Zhang,Guang Li,Kun Wang,Zheng Zhu,Zhiheng Ma,Joey Tianyi Zhou,Jiancheng Lv,Yaochu Jin,Peihao Wang,Kaipeng Zhang,Lingjuan Lyu,Yiran Huang,Zeynep Akata,Zhiwei Deng,Xindi Wu,George Cazenavette,Yuzhang Shang,Justin Cui,Jindong Gu,Qian Zheng,Hao Ye,Shuo Wang,Xiaobo Wang,Yan Yan,Angela Yao,Mike Zheng Shou,Tianlong Chen,Hakan Bilen,Baharan Mirzasoleiman,Manolis Kellis,Konstantinos N. Plataniotis,Zhangyang Wang,Bo Zhao,Yang You,Kai Wang
机构: Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of the images themselves, with even randomly sampled images achieving superior results. Such misaligned evaluation settings severely hinder the development of DD. Therefore, we propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods. By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.
zh

[CV-18] RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization

【速读】:该论文试图解决现实世界数据中存在未知或近似对称性,而现有等变网络必须在训练前指定固定的变换群(如连续的SO(2)旋转)导致的性能下降问题。解决方案的关键在于提出RECON框架,该框架从无标签数据中发现每个输入的内在对称性分布,并通过类别-姿态分解和数据驱动的归一化方法,将任意参考系对齐到一个共同的自然姿态,从而生成可直接比较和解释的对称性描述符。

链接: https://arxiv.org/abs/2505.13289
作者: Alonso Urbano,David W. Romero,Max Zimmer,Sebastian Pokutta
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world data often exhibits unknown or approximate symmetries, yet existing equivariant networks must commit to a fixed transformation group prior to training, e.g., continuous SO(2) rotations. This mismatch degrades performance when the actual data symmetries differ from those in the transformation group. We introduce RECON, a framework to discover each input’s intrinsic symmetry distribution from unlabeled data. RECON leverages class-pose decompositions and applies a data-driven normalization to align arbitrary reference frames into a common natural pose, yielding directly comparable and interpretable symmetry descriptors. We demonstrate effective symmetry discovery on 2D image benchmarks and – for the first time – extend it to 3D transformation groups, paving the way towards more flexible equivariant modeling.
zh

[CV-19] Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts

【速读】:该论文试图解决计算机视觉模型与人类对几何和拓扑(GT)概念敏感性之间是否具有对齐的问题,即探讨GT概念是先天存在的还是通过日常环境互动“免费学习”的。解决方案的关键在于使用三种类型的计算机视觉模型(卷积神经网络、基于Transformer的模型和视觉语言模型)在一项测试43个GT概念的奇数出列任务中评估其整体性能和与人类表现的对齐程度,结果显示基于Transformer的模型在准确性和与儿童表现的一致性上优于其他模型,而视觉语言模型则表现出较差的性能和与人类表现的偏离,表明多模态整合可能对抽象几何敏感性产生不利影响。

链接: https://arxiv.org/abs/2505.13281
作者: Zekun Wang,Sashank Varma
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, CosSci 2025

点击查看摘要

Abstract:With the rapid improvement of machine learning (ML) models, cognitive scientists are increasingly asking about their alignment with how humans think. Here, we ask this question for computer vision models and human sensitivity to geometric and topological (GT) concepts. Under the core knowledge account, these concepts are innate and supported by dedicated neural circuitry. In this work, we investigate an alternative explanation, that GT concepts are learned ``for free’’ through everyday interaction with the environment. We do so using computer visions models, which are trained on large image datasets. We build on prior studies to investigate the overall performance and human alignment of three classes of models – convolutional neural networks (CNNs), transformer-based models, and vision-language models – on an odd-one-out task testing 43 GT concepts spanning seven classes. Transformer-based models achieve the highest overall accuracy, surpassing that of young children. They also show strong alignment with children’s performance, finding the same classes of concepts easy vs. difficult. By contrast, vision-language models underperform their vision-only counterparts and deviate further from human profiles, indicating that naïve multimodality might compromise abstract geometric sensitivity. These findings support the use of computer vision models to evaluate the sufficiency of the learning account for explaining human sensitivity to GT concepts, while also suggesting that integrating linguistic and visual representations might have unpredicted deleterious consequences.
zh

[CV-20] Event-Driven Dynamic Scene Depth Completion

【速读】:该论文旨在解决动态场景中深度补全(depth completion)的问题,这一问题由于自身运动(ego-motion)和物体运动导致RGB图像和LiDAR测量等输入模态的质量严重退化。解决方案的关键在于提出EventDC框架,其核心组件包括事件调制对齐(Event-Modulated Alignment, EMA)和局部深度滤波(Local Depth Filtering, LDF),这两个模块通过基于运动敏感事件流的偏移量和权重自适应学习,实现了更精确的特征对齐与深度估计优化。

链接: https://arxiv.org/abs/2505.13279
作者: Zhiqiang Yan,Jianhao Jiao,Zhengxue Wang,Gim Hee Lee
机构: NUS(新加坡国立大学); UCL(伦敦大学学院); NJUST(南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Depth completion in dynamic scenes poses significant challenges due to rapid ego-motion and object motion, which can severely degrade the quality of input modalities such as RGB images and LiDAR measurements. Conventional RGB-D sensors often struggle to align precisely and capture reliable depth under such conditions. In contrast, event cameras with their high temporal resolution and sensitivity to motion at the pixel level provide complementary cues that are %particularly beneficial in dynamic this http URL this end, we propose EventDC, the first event-driven depth completion framework. It consists of two key components: Event-Modulated Alignment (EMA) and Local Depth Filtering (LDF). Both modules adaptively learn the two fundamental components of convolution operations: offsets and weights conditioned on motion-sensitive event streams. In the encoder, EMA leverages events to modulate the sampling positions of RGB-D features to achieve pixel redistribution for improved alignment and fusion. In the decoder, LDF refines depth estimations around moving objects by learning motion-aware masks from events. Additionally, EventDC incorporates two loss terms to further benefit global alignment and enhance local depth recovery. Moreover, we establish the first benchmark for event-based depth completion comprising one real-world and two synthetic datasets to facilitate future research. Extensive experiments on this benchmark demonstrate the superiority of our EventDC.
zh

[CV-21] DB3D-L: Depth-aware BEV Feature Transformation for Accurate 3D Lane Detection

【速读】:该论文试图解决从前视图像(Front-View, FV)中构建精确的鸟瞰图(Birds-Eye-View, BEV)特征的问题,这一问题由于缺乏深度信息而受到限制,导致以往方法通常依赖于平坦地面平面的假设。解决方案的关键在于提出一种基于深度感知的BEV特征变换方法,其中通过集成深度网络获取关键深度信息以简化视图变换的复杂性,并引入特征压缩模块和融合模块,实现FV特征与深度特征的有效融合,从而提升3D车道检测的准确性。

链接: https://arxiv.org/abs/2505.13266
作者: Yehao Liu,Xiaosu Xu,Zijian Wang,Yiqing Yao
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Lane detection plays an important role in autonomous driving. Recent advances primarily build Birds-Eye-View (BEV) feature from front-view (FV) images to perceive 3D information of Lane more effectively. However, constructing accurate BEV information from FV image is limited due to the lacking of depth information, causing previous works often rely heavily on the assumption of a flat ground plane. Leveraging monocular depth estimation to assist in constructing BEV features is less constrained, but existing methods struggle to effectively integrate the two tasks. To address the above issue, in this paper, an accurate 3D lane detection method based on depth-aware BEV feature transtormation is proposed. In detail, an effective feature extraction module is designed, in which a Depth Net is integrated to obtain the vital depth information for 3D perception, thereby simplifying the complexity of view transformation. Subquently a feature reduce module is proposed to reduce height dimension of FV features and depth features, thereby enables effective fusion of crucial FV features and depth features. Then a fusion module is designed to build BEV feature from prime FV feature and depth information. The proposed method performs comparably with state-of-the-art methods on both synthetic Apollo, realistic OpenLane datasets.
zh

[CV-22] Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)微调在多模态推理任务中因问题难度分布不均而导致的训练效率与效果受限的问题。其解决方案的关键在于通过显式建模问题难度先验信息,从三个层面优化训练过程:首先,通过离线数据筛选去除过于简单或极端困难的提示以提升梯度质量;其次,采用在线优势差异机制,利用群体经验准确率作为难度代理,自适应地重新加权优势估计;最后,在第二阶段训练中引入难度提示作为显式引导,促使模型调整推理深度并进行反思性验证。

链接: https://arxiv.org/abs/2505.13261
作者: Mingrui Chen,Haogeng Liu,Hao Liang,Huaibo Huang,Wentao Zhang,Ran He
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (人工智能学院,中国科学院大学); NLPR&MAIS, Institute of Automation, Chinese Academy of Sciences (模式识别国家重点实验室与媒体智能系统重点实验室,中国科学院自动化研究所); Peking University (北京大学); Zhongguancun Academy (中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we investigate how explicitly modeling problem’s difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning. Our exploration mainly comprises of following three perspective: First, through offline data curation, we analyze the U-shaped difficulty distribution of two given datasets using the base model by multi-round sampling, and then filter out prompts that are either too simple or extremely difficult to provide meaningful gradients and perform subsequent two-stage training. Second, we implement an online advantage differentiation, computing group-wise empirical accuracy as a difficulty proxy to adaptively reweight advantages estimation, providing stronger learning signals for more challenging problems. Finally, we introduce difficulty hints as explicit prompts for more complex samples in the second training stage, encouraging the model to calibrate its reasoning depth and perform reflective validation checks. Our comprehensive approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
zh

[CV-23] Joint Depth and Reflectivity Estimation using Single-Photon LiDAR

【速读】:该论文旨在解决单光子激光雷达(SP-LiDAR)在快速运动场景中同时恢复深度和反射率的问题,传统方法通常分别或顺序地处理这两个信息,且常规的三维直方图构建在动态场景中效率较低。论文提出的解决方案关键在于通过理论分析揭示深度与反射率之间的相互关联,并提出一种名为SPLiDER的新重建方法,该方法利用两者共享的信息以提升信号恢复效果,从而在合成与真实SP-LiDAR数据上实现更优的联合重建质量。

链接: https://arxiv.org/abs/2505.13250
作者: Hashan K. Weerasooriya,Prateek Chennuri,Weijian Zhang,Istvan Gyongy,Stanley H. Chan
机构: Purdue University (普渡大学); The University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-Photon Light Detection and Ranging (SP-LiDAR is emerging as a leading technology for long-range, high-precision 3D vision tasks. In SP-LiDAR, timestamps encode two complementary pieces of information: pulse travel time (depth) and the number of photons reflected by the object (reflectivity). Existing SP-LiDAR reconstruction methods typically recover depth and reflectivity separately or sequentially use one modality to estimate the other. Moreover, the conventional 3D histogram construction is effective mainly for slow-moving or stationary scenes. In dynamic scenes, however, it is more efficient and effective to directly process the timestamps. In this paper, we introduce an estimation method to simultaneously recover both depth and reflectivity in fast-moving scenes. We offer two contributions: (1) A theoretical analysis demonstrating the mutual correlation between depth and reflectivity and the conditions under which joint estimation becomes beneficial. (2) A novel reconstruction method, “SPLiDER”, which exploits the shared information to enhance signal recovery. On both synthetic and real SP-LiDAR data, our method outperforms existing approaches, achieving superior joint reconstruction quality.
zh

[CV-24] WriteViT: Handwritten Text Generation with Vision Transformer

【速读】:该论文试图解决从单个示例中泛化手写风格的问题,特别是在低数据环境下,机器难以捕捉细微的空间和风格线索。其解决方案的关键在于引入WriteViT框架,该框架利用Vision Transformers (ViT)作为核心组件,包括基于ViT的书写者识别器以提取风格嵌入、多尺度生成器结合条件位置编码(CPE)的Transformer编码器-解码器结构,以及轻量级的ViT识别器,从而更有效地捕捉细粒度笔画细节和高层次风格信息。

链接: https://arxiv.org/abs/2505.13235
作者: Dang Hoai Nam,Huynh Tong Dang Khoa,Vo Nguyen Le Duy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humans can quickly generalize handwriting styles from a single example by intuitively separating content from style. Machines, however, struggle with this task, especially in low-data settings, often missing subtle spatial and stylistic cues. Motivated by this gap, we introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT), a family of models that have shown strong performance across various computer vision tasks. WriteViT integrates a ViT-based Writer Identifier for extracting style embeddings, a multi-scale generator built with Transformer encoder-decoder blocks enhanced by conditional positional encoding (CPE), and a lightweight ViT-based recognizer. While previous methods typically rely on CNNs or CRNNs, our design leverages transformers in key components to better capture both fine-grained stroke details and higher-level style information. Although handwritten text synthesis has been widely explored, its application to Vietnamese – a language rich in diacritics and complex typography – remains limited. Experiments on Vietnamese and English datasets demonstrate that WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios. These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.
zh

[CV-25] From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

【速读】:该论文旨在解决预训练视觉-语言模型(VLMs)在零样本任务中因视觉增强技术引入的背景噪声和局部细节过度关注问题,从而影响全局语义理解。解决方案的关键在于提出一种基于注意力的选择(ABS)方法,通过在原始图像和特征空间中应用注意力引导的裁剪,并结合策略性特征选择来补充全局语义信息,同时引入软匹配技术以优化与大语言模型(LLM)生成描述的对齐效果。

链接: https://arxiv.org/abs/2505.13233
作者: Lincan Cai,Jingxuan Kang,Shuang Li,Wenxuan Ma,Binhui Xie,Zhida Qin,Jian Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an \textbfAttention-\textbfBased \textbfSelection (\textbfABS) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment. \textbfABS achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably, \textbfABS is training-free and even rivals few-shot and test-time adaptation methods. Our code is available at \hrefthis https URL\textcolordarkgreenthis https URL.
zh

[CV-26] StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment

【速读】:该论文旨在解决零样本模型在微调过程中因数据规模较小而出现的鲁棒性下降问题,特别是模型容易学习到对人类而言不相关的特征(如背景或纹理)的问题。其解决方案的关键在于提出一种名为StarFT(Spurious Textual Alignment Regularization)的新框架,通过引入一种正则化方法,将注入了虚假特征的标签输出分布与原始零样本模型进行对齐,从而防止模型进一步提取无关特征,增强模型的鲁棒性。

链接: https://arxiv.org/abs/2505.13232
作者: Younghyun Kim,Jongheon Jeong,Sangkyung Kwak,Kyungmin Lee,Juho Lee,Jinwoo Shin
机构: Samsung(三星); Korea University(韩国高丽大学); General Robotics(通用机器人); KAIST(韩国科学技术院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim to preserve the original domain as much as possible. However, in a different context, fine-tuned models with limited data are also prone to learning features that are spurious to humans, such as background or texture. In this paper, we propose StarFT (Spurious Textual Alignment Regularization), a novel framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity. We introduce a regularization that aligns the output distribution for spuriosity-injected labels with the original zero-shot model, ensuring that the model is not induced to extract irrelevant features further from these this http URL leverage recent language models to get such spuriosity-injected labels by generating alternative textual descriptions that highlight potentially confounding this http URL experiments validate the robust generalization of StarFT and its emerging properties: zero-shot group robustness and improved zero-shot classification. Notably, StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance.
zh

[CV-27] Automatic Complementary Separation Pruning Toward Lightweight CNNs

【速读】:该论文旨在解决卷积神经网络中高效且自动化剪枝的问题,以在保持网络性能的同时减少计算成本。其解决方案的关键在于提出一种名为自动互补分离剪枝(Automatic Complementary Separation Pruning, ACSP)的新方法,该方法结合了结构化剪枝和基于激活的剪枝的优势,通过构建图空间来编码每个组件对所有类别对的分离能力,并利用互补选择原则和聚类算法确保所选组件具有多样性和互补性,从而降低冗余并维持高性能。此外,ACSP通过膝点查找算法自动确定每层最优组件子集,无需用户定义剪枝比例,实现了完全自动化。

链接: https://arxiv.org/abs/2505.13225
作者: David Levin,Gonen Singer
机构: Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present Automatic Complementary Separation Pruning (ACSP), a novel and fully automated pruning method for convolutional neural networks. ACSP integrates the strengths of both structured pruning and activation-based pruning, enabling the efficient removal of entire components such as neurons and channels while leveraging activations to identify and retain the most relevant components. Our approach is designed specifically for supervised learning tasks, where we construct a graph space that encodes the separation capabilities of each component with respect to all class pairs. By employing complementary selection principles and utilizing a clustering algorithm, ACSP ensures that the selected components maintain diverse and complementary separation capabilities, reducing redundancy and maintaining high network performance. The method automatically determines the optimal subset of components in each layer, utilizing a knee-finding algorithm to select the minimal subset that preserves performance without requiring user-defined pruning volumes. Extensive experiments on multiple architectures, including VGG-16, ResNet-50, and MobileNet-V2, across datasets like CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrate that ACSP achieves competitive accuracy compared to other methods while significantly reducing computational costs. This fully automated approach not only enhances scalability but also makes ACSP especially practical for real-world deployment by eliminating the need for manually defining the pruning volume.
zh

[CV-28] Swin DiT: Diffusion Transformer using Pseudo Shifted Windows

【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在处理高分辨率图像时面临的计算成本过高以及全局信息建模冗余的问题。传统DiTs通过堆叠序列的各向同性全局信息建模变换器来生成图像,但这种结构在高分辨率场景下效率低下。论文提出的关键解决方案是Pseudo Shifted Window Attention (PSWA),该方法通过窗口注意力实现中间层次的全局-局部信息交互,并利用高频桥接分支模拟位移窗口操作,从而有效减少全局模型冗余并补充必要的全局与高频信息。此外,论文还提出了Progressive Coverage Channel Allocation (PCCA)策略,在不增加额外计算成本的情况下捕捉高阶注意力相似性。

链接: https://arxiv.org/abs/2505.13219
作者: Jiafu Wu,Yabiao Wang,Jian Li,Jinlong Peng,Yun Cao,Chengjie Wang,Jiangning Zhang
机构: YouTu Lab (优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve remarkable performance within the domain of image generation through the incorporation of the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global information modeling transformers, which face significant computational cost when processing high-resolution images. We empirically analyze that latent space image generation does not exhibit a strong dependence on global information as traditionally assumed. Most of the layers in the model demonstrate redundancy in global computation. In addition, conventional attention mechanisms exhibit low-frequency inertia issues. To address these issues, we propose \textbfPseudo \textbfShifted \textbfWindow \textbfAttention (PSWA), which fundamentally mitigates global model redundancy. PSWA achieves intermediate global-local information interaction through window attention, while employing a high-frequency bridging branch to simulate shifted window operations, supplementing appropriate global and high-frequency information. Furthermore, we propose the Progressive Coverage Channel Allocation(PCCA) strategy that captures high-order attention similarity without additional computational cost. Building upon all of them, we propose a series of Pseudo \textbfShifted \textbfWindow DiTs (\textbfSwin DiT), accompanied by extensive experiments demonstrating their superior performance. For example, our proposed Swin-DiT-L achieves a 54% \uparrow FID improvement over DiT-XL/2 while requiring less computational. this https URL
zh

[CV-29] Hybrid 3D-4D Gaussian Splatting for Fast Dynamic Scene Representation

【速读】:该论文试图解决动态三维场景重建中由于冗余分配四维高斯(4D Gaussians)到静态区域而导致的计算和内存开销大以及图像质量下降的问题。其解决方案的关键在于提出了一种混合三维-四维高斯点云渲染(3D-4D Gaussian Splatting, 3D-4DGS)框架,该框架通过自适应地用三维高斯(3D Gaussians)表示静态区域,同时保留四维高斯用于动态元素,从而显著减少参数数量并提升计算效率。

链接: https://arxiv.org/abs/2505.13215
作者: Seungjun Oh,Younggeun Lee,Hyejin Jeon,Eunbyung Park
机构: Sungkyunkwan University (成均馆大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Recent advancements in dynamic 3D scene reconstruction have shown promising results, enabling high-fidelity 3D novel view synthesis with improved temporal consistency. Among these, 4D Gaussian Splatting (4DGS) has emerged as an appealing approach due to its ability to model high-fidelity spatial and temporal variations. However, existing methods suffer from substantial computational and memory overhead due to the redundant allocation of 4D Gaussians to static regions, which can also degrade image quality. In this work, we introduce hybrid 3D-4D Gaussian Splatting (3D-4DGS), a novel framework that adaptively represents static regions with 3D Gaussians while reserving 4D Gaussians for dynamic elements. Our method begins with a fully 4D Gaussian representation and iteratively converts temporally invariant Gaussians into 3D, significantly reducing the number of parameters and improving computational efficiency. Meanwhile, dynamic Gaussians retain their full 4D representation, capturing complex motions with high fidelity. Our approach achieves significantly faster training times compared to baseline 4D Gaussian Splatting methods while maintaining or improving the visual quality.
zh

[CV-30] RB-SCD: A New Benchmark for Semantic Change Detection of Roads and Bridges in Traffic Scenes

【速读】:该论文旨在解决道路和桥梁等交通基础设施在建设、改造和拆除等细粒度语义变化检测的问题,现有方法因缺乏高质量标注数据而难以提取精确的语义变化信息。其解决方案的关键在于提出了一种新的多模态频域驱动变化检测框架(Multimodal Frequency-Driven Change Detector, MFDCD),该框架通过动态频域耦合器(Dynamic Frequency Coupler, DFC)融合层次化视觉特征与小波频域成分,并利用文本频域滤波器(Textual Frequency Filter, TFF)将CLIP生成的文本特征转换至频域并进行图基过滤,从而提升变化检测的准确性。

链接: https://arxiv.org/abs/2505.13212
作者: Qingling Shu,Sibao Chen,Zhihui You,Wei Lu,Jin Tang,Bin Luo
机构: Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection of changes in roads and bridges, such as construction, renovation, and demolition, is essential for urban planning and traffic management. However, existing methods often struggle to extract fine-grained semantic change information due to the lack of high-quality annotated datasets in traffic scenarios. To address this, we introduce the Road and Bridge Semantic Change Detection (RB-SCD) dataset, a comprehensive benchmark comprising 260 pairs of high-resolution remote sensing images from diverse cities and countries. RB-SCD captures 11 types of semantic changes across varied road and bridge structures, enabling detailed structural and functional analysis. Building on this dataset, we propose a novel framework, Multimodal Frequency-Driven Change Detector (MFDCD), which integrates multimodal features in the frequency domain. MFDCD includes a Dynamic Frequency Coupler (DFC) that fuses hierarchical visual features with wavelet-based frequency components, and a Textual Frequency Filter (TFF) that transforms CLIP-derived textual features into the frequency domain and applies graph-based filtering. Experimental results on RB-SCD and three public benchmarks demonstrate the effectiveness of our approach.
zh

[CV-31] MAGI-1: Autoregressive Video Generation at Scale

【速读】:该论文提出MAGI-1,旨在解决生成高质量、长时间视频的挑战,特别是在图像到视频(I2V)任务中实现高时间一致性和可扩展性。其关键解决方案是通过自回归预测一系列固定长度的视频块(video chunks),并训练模型以消除随时间单调增加的块内噪声,从而实现因果时序建模和流式生成。此外,MAGI-1通过分块提示(chunk-wise prompting)支持可控生成,并通过保持恒定的峰值推理成本实现了实时、内存高效的部署。

链接: https://arxiv.org/abs/2505.13211
作者: Sand.ai,Hansi Teng,Hongyu Jia,Lei Sun,Lingzhi Li,Maolin Li,Mingqiu Tang,Shuai Han,Tianning Zhang,W.Q. Zhang,Weifeng Luo,Xiaoyang Kang,Yuchen Sun,Yue Cao,Yunpeng Huang,Yutong Lin,Yuxin Fang,Zewei Tao,Zheng Zhang,Zhongshu Wang,Zixun Liu,Dai Shi,Guoli Su,Hanwen Sun,Hong Pan,Jie Wang,Jiexin Sheng,Min Cui,Min Hu,Ming Yan,Shucheng Yin,Siran Zhang,Tingting Liu,Xianping Yin,Xiaoyu Yang,Xin Song,Xuan Hu,Yankai Zhang,Yuqiao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MAGI-1, a world model that generates videos by autoregressively predicting a sequence of video chunks, defined as fixed-length segments of consecutive frames. Trained to denoise per-chunk noise that increases monotonically over time, MAGI-1 enables causal temporal modeling and naturally supports streaming generation. It achieves strong performance on image-to-video (I2V) tasks conditioned on text instructions, providing high temporal consistency and scalability, which are made possible by several algorithmic innovations and a dedicated infrastructure stack. MAGI-1 facilitates controllable generation via chunk-wise prompting and supports real-time, memory-efficient deployment by maintaining constant peak inference cost, regardless of video length. The largest variant of MAGI-1 comprises 24 billion parameters and supports context lengths of up to 4 million tokens, demonstrating the scalability and robustness of our approach. The code and models are available at this https URL and this https URL. The product can be accessed at this https URL.
zh

[CV-32] MatPredict: a dataset and benchmark for learning material properties of diverse indoor objects

【速读】:该论文旨在解决从相机图像中确定材料属性的问题,以增强室内环境中复杂物体的识别能力,这对消费机器人应用具有重要意义。解决方案的关键在于构建MatPredict数据集,该数据集结合了Replica数据集的高质量合成物体与MatSynth数据集的材料属性类别,生成具有多样化材料属性的物体,并通过不同光照和相机位置设置提供场景变化,从而支持基于视觉图像推断材料属性的基准测试。

链接: https://arxiv.org/abs/2505.13201
作者: Yuzhen Chen,Hojun Son,Arpan Kusari
机构: University of Michigan Transportation Research Institute (密歇根大学交通研究所); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Determining material properties from camera images can expand the ability to identify complex objects in indoor environments, which is valuable for consumer robotics applications. To support this, we introduce MatPredict, a dataset that combines the high-quality synthetic objects from Replica dataset with MatSynth dataset’s material properties classes - to create objects with diverse material properties. We select 3D meshes of specific foreground objects and render them with different material properties. In total, we generate \textbf18 commonly occurring objects with \textbf14 different materials. We showcase how we provide variability in terms of lighting and camera placement for these objects. Next, we provide a benchmark for inferring material properties from visual images using these perturbed models in the scene, discussing the specific neural network models involved and their performance based on different image comparison metrics. By accurately simulating light interactions with different materials, we can enhance realism, which is crucial for training models effectively through large-scale simulations. This research aims to revolutionize perception in consumer robotics. The dataset is provided \hrefthis https URLhere and the code is provided \hrefthis https URLhere.
zh

[CV-33] Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

【速读】:该论文旨在解决现有硬注意力模型(如Recurrence Model of Visual Attention, RAM 和 Deep Recurrent Attention Model, DRAM)未能建模人类视觉系统的层次结构,导致注意力机制在固定和快速眼动之间失衡的问题。其解决方案的关键在于提出一种多层级循环注意力模型(Multi-Level Recurrent Attention Model, MRAM),通过在两个循环层中解耦“瞥视位置生成”与“任务执行”的功能,从而实现固定与快速眼动之间的平衡行为,使注意力动态更接近人类眼动模式。

链接: https://arxiv.org/abs/2505.13191
作者: Pengcheng Pan,Yonekura Shogo,Yasuo Kuniyoshi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.
zh

[CV-34] FlowCut: Unsupervised Video Instance Segmentation via Temporal Mask Matching

【速读】:该论文试图解决无监督视频实例分割(unsupervised video instance segmentation)问题,即在没有人工标注的情况下,为视频中的每个对象生成精确的实例分割掩码。解决方案的关键在于提出一种三阶段框架——FlowCut,通过利用图像和光流特征的关联性生成伪实例掩码,在时间上对齐帧间的伪标签以构建高质量的视频片段,并最终利用已有的视频数据集进行模型训练,从而实现高性能的视频分割。

链接: https://arxiv.org/abs/2505.13174
作者: Alp Eren Sari,Paolo Favaro
机构: University of Bern (伯尔尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose FlowCut, a simple and capable method for unsupervised video instance segmentation consisting of a three-stage framework to construct a high-quality video dataset with pseudo labels. To our knowledge, our work is the first attempt to curate a video dataset with pseudo-labels for unsupervised video instance segmentation. In the first stage, we generate pseudo-instance masks by exploiting the affinities of features from both images and optical flows. In the second stage, we construct short video segments containing high-quality, consistent pseudo-instance masks by temporally matching them across the frames. In the third stage, we use the YouTubeVIS-2021 video dataset to extract our training instance segmentation set, and then train a video segmentation model. FlowCut achieves state-of-the-art performance on the YouTubeVIS-2019, YouTubeVIS-2021, DAVIS-2017, and DAVIS-2017 Motion benchmarks.
zh

[CV-35] CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

【速读】:该论文试图解决3D人体运动预测中密度估计计算耗时过长的问题,传统方法的推理时间往往超过预测时间范围。其解决方案的关键在于提出一种基于流的生成方法——CacheFlow,该方法利用无条件流生成模型将高斯混合分布转换为未来运动的密度,并预先计算和缓存结果。在条件预测阶段,通过一个轻量级模型将历史轨迹映射到高斯混合中的样本,从而显著减少计算开销,实现高效的密度估计。

链接: https://arxiv.org/abs/2505.13140
作者: Takahiro Maeda,Jinkun Cao,Norimichi Ukita,Kris Kitani
机构: Toyota Technological Institute (丰田技术研究所); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models will be publicly available.
zh

[CV-36] Learning to Adapt to Position Bias in Vision Transformer Classifiers

【速读】:该论文试图解决位置信息在图像分类任务中的作用及其对Vision Transformers性能的影响问题。研究指出,位置信息在不同数据集中具有不同的偏置水平,而这种位置偏置对模型性能至关重要。解决方案的关键在于提出Position-SHAP,一种通过扩展SHAP方法以适应位置嵌入的直接测量位置偏置的方法,并进一步提出Auto-PE,一种单参数的位置嵌入扩展机制,允许位置嵌入调节其范数,从而实现对位置信息的遗忘。该方法结合现有位置嵌入,在分类数据集上实现了与或优于现有方法的准确率。

链接: https://arxiv.org/abs/2505.13137
作者: Robert-Jan Bruintjes,Jan van Gemert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:How discriminative position information is for image classification depends on the data. On the one hand, the camera position is arbitrary and objects can appear anywhere in the image, arguing for translation invariance. At the same time, position information is key for exploiting capture/center bias, and scene layout, e.g.: the sky is up. We show that position bias, the level to which a dataset is more easily solved when positional information on input features is used, plays a crucial role in the performance of Vision Transformers image classifiers. To investigate, we propose Position-SHAP, a direct measure of position bias by extending SHAP to work with position embeddings. We show various levels of position bias in different datasets, and find that the optimal choice of position embedding depends on the position bias apparent in the dataset. We therefore propose Auto-PE, a single-parameter position embedding extension, which allows the position embedding to modulate its norm, enabling the unlearning of position information. Auto-PE combines with existing PEs to match or improve accuracy on classification datasets.
zh

[CV-37] Adaptive Image Restoration for Video Surveillance: A Real-Time Approach

【速读】:该论文试图解决视频监控中图像退化问题,特别是在检测、分割、识别、监控和自动化解决方案中,由于雨、雾、光照等因素导致的图像质量下降问题。其关键解决方案是利用基于ResNet_50的迁移学习方法,开发一种能够自动识别图像中退化类型并参考相应处理方法的模型,从而实现实时图像恢复。

链接: https://arxiv.org/abs/2505.13130
作者: Muhammad Awais Amin,Adama Ilboudo,Abdul Samad bin Shahid,Amjad Ali,Waqas Haider Khan Bangyal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One of the major challenges in the field of computer vision especially for detection, segmentation, recognition, monitoring, and automated solutions, is the quality of images. Image degradation, often caused by factors such as rain, fog, lighting, etc., has a negative impact on automated this http URL, several image restoration solutions exist, including restoration models for single degradation and restoration models for multiple degradations. However, these solutions are not suitable for real-time processing. In this study, the aim was to develop a real-time image restoration solution for video surveillance. To achieve this, using transfer learning with ResNet_50, we developed a model for automatically identifying the types of degradation present in an image to reference the necessary treatment(s) for image restoration. Our solution has the advantage of being flexible and scalable.
zh

[CV-38] Just Dance with π! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection

【速读】:该论文旨在解决弱监督视频异常检测(VAD)中由于仅依赖RGB时空特征而导致的可靠性不足问题,特别是在区分视觉相似事件(如盗窃行为)时表现有限。其解决方案的关键在于引入多模态诱导框架“PI-VAD”,通过融合五种额外模态(包括姿态、深度、全景分割、光流和语言线索)来增强RGB表示,从而提升模型对复杂现实场景的适应能力。该框架包含两个插件模块:伪模态生成模块和跨模态诱导模块,通过异常感知的辅助任务生成模态特定的原型表示,并将多模态信息融入RGB特征中,最终在不增加推理阶段计算开销的情况下实现最优性能。

链接: https://arxiv.org/abs/2505.13123
作者: Snehashis Majhi,Giacomo D’Amicantonio,Antitza Dantcheva,Quan Kong,Lorenzo Garattoni,Gianpiero Francesca,Egor Bondarev,Francois Bremond
机构: INRIA; Côte d’Azur University; Woven by Toyota; Toyota Motor Europe; Eindhoven University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Weakly-supervised methods for video anomaly detection (VAD) are conventionally based merely on RGB spatio-temporal features, which continues to limit their reliability in real-world scenarios. This is due to the fact that RGB-features are not sufficiently distinctive in setting apart categories such as shoplifting from visually similar events. Therefore, towards robust complex real-world VAD, it is essential to augment RGB spatio-temporal features by additional modalities. Motivated by this, we introduce the Poly-modal Induced framework for VAD: “PI-VAD”, a novel approach that augments RGB representations by five additional modalities. Specifically, the modalities include sensitivity to fine-grained motion (Pose), three dimensional scene and entity representation (Depth), surrounding objects (Panoptic masks), global motion (optical flow), as well as language cues (VLM). Each modality represents an axis of a polygon, streamlined to add salient cues to RGB. PI-VAD includes two plug-in modules, namely Pseudo-modality Generation module and Cross Modal Induction module, which generate modality-specific prototypical representation and, thereby, induce multi-modal information into RGB cues. These modules operate by performing anomaly-aware auxiliary tasks and necessitate five modality backbones – only during training. Notably, PI-VAD achieves state-of-the-art accuracy on three prominent VAD datasets encompassing real-world scenarios, without requiring the computational overhead of five modality backbones at inference.
zh

[CV-39] ARIW-Framework: Adaptive Robust Iterative Watermarking Framework

【速读】:该论文试图解决生成式图像内容版权保护中的关键安全挑战,特别是在大规模模型迅速发展的背景下,传统深度学习水印技术在视觉质量、鲁棒性和泛化能力方面存在局限。解决方案的关键在于提出一种自适应鲁棒迭代水印框架(ARIW-Framework),通过迭代方法优化编码器以生成鲁棒的残差,并引入噪声层和解码器计算不同噪声攻击下的鲁棒性权重,同时利用图像梯度确定每个像素位置的嵌入强度,从而在保持卓越鲁棒性和泛化能力的同时实现高质量的水印图像。

链接: https://arxiv.org/abs/2505.13101
作者: Shaowu Wu,Liting Zeng,Wei Lu,Xiangyang Luo
机构: Sun Yat-sen University (中山大学); State Key Laboratory of Mathematical Engineering and Advanced Computing (数学工程与先进计算国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:With the rapid rise of large models, copyright protection for generated image content has become a critical security challenge. Although deep learning watermarking techniques offer an effective solution for digital image copyright protection, they still face limitations in terms of visual quality, robustness and generalization. To address these issues, this paper proposes an adaptive robust iterative watermarking framework (ARIW-Framework) that achieves high-quality watermarked images while maintaining exceptional robustness and generalization performance. Specifically, we introduce an iterative approach to optimize the encoder for generating robust residuals. The encoder incorporates noise layers and a decoder to compute robustness weights for residuals under various noise attacks. By employing a parallel optimization strategy, the framework enhances robustness against multiple types of noise attacks. Furthermore, we leverage image gradients to determine the embedding strength at each pixel location, significantly improving the visual quality of the watermarked images. Extensive experiments demonstrate that the proposed method achieves superior visual quality while exhibiting remarkable robustness and generalization against noise attacks.
zh

[CV-40] Industry-focused Synthetic Segmentation Pre-training

【速读】:该论文试图解决工业应用中实例分割模型训练面临的两大挑战:一是法律和伦理限制,例如ImageNet禁止商业使用;二是由于网络图像与工业图像之间的领域差异导致的迁移能力有限。为了解决这些问题,论文提出了一种基于公式驱动监督学习(Formula-Driven Supervised Learning, FDSL)的合成预训练数据集InsCore。其关键在于无需依赖真实图像或人工标注,通过生成具有工业数据关键特征的完全标注实例分割图像,从而构建适用于工业场景的视觉基础模型。

链接: https://arxiv.org/abs/2505.13099
作者: Shinichi Mae,Ryosuke Yamada,Hirokatsu Kataoka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-training on real-image datasets has been widely proven effective for improving instance segmentation. However, industrial applications face two key challenges: (1) legal and ethical restrictions, such as ImageNet’s prohibition of commercial use, and (2) limited transferability due to the domain gap between web images and industrial imagery. Even recent vision foundation models, including the segment anything model (SAM), show notable performance degradation in industrial settings. These challenges raise critical questions: Can we build a vision foundation model for industrial applications without relying on real images or manual annotations? And can such models outperform even fine-tuned SAM on industrial datasets? To address these questions, we propose the Instance Core Segmentation Dataset (InsCore), a synthetic pre-training dataset based on formula-driven supervised learning (FDSL). InsCore generates fully annotated instance segmentation images that reflect key characteristics of industrial data, including complex occlusions, dense hierarchical masks, and diverse non-rigid shapes, distinct from typical web imagery. Unlike previous methods, InsCore requires neither real images nor human annotations. Experiments on five industrial datasets show that models pre-trained with InsCore outperform those trained on COCO and ImageNet-21k, as well as fine-tuned SAM, achieving an average improvement of 6.2 points in instance segmentation performance. This result is achieved using only 100k synthetic images, more than 100 times fewer than the 11 million images in SAM’s SA-1B dataset, demonstrating the data efficiency of our approach. These findings position InsCore as a practical and license-free vision foundation model for industrial applications.
zh

[CV-41] ouch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction

【速读】:该论文试图解决现有3D扩散模型在捕捉复杂形状局部细节方面的不足,以及受限于遮挡和光照条件的问题。其解决方案的关键在于引入触觉图像以获取局部3D信息,并提出Touch2Shape模型,该模型通过触觉条件扩散模型实现目标形状的探索与重建。具体而言,模型包含一个触觉嵌入模块用于生成紧凑表示,以及一个触觉形状融合模块用于优化重建结果;同时,结合扩散模型与强化学习,设计了一种新颖的奖励机制来训练触觉探索策略,从而提升重建性能。

链接: https://arxiv.org/abs/2505.13091
作者: Yuanbo Wang,Zhaoxuan Zhang,Jiajin Qiu,Dilong Sun,Zhengyu Meng,Xiaopeng Wei,Xin Yang
机构: Dalian University of Technology (大连理工大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Diffusion models have made breakthroughs in 3D generation tasks. Current 3D diffusion models focus on reconstructing target shape from images or a set of partial observations. While excelling in global context understanding, they struggle to capture the local details of complex shapes and limited to the occlusion and lighting conditions. To overcome these limitations, we utilize tactile images to capture the local 3D information and propose a Touch2Shape model, which leverages a touch-conditioned diffusion model to explore and reconstruct the target shape from touch. For shape reconstruction, we have developed a touch embedding module to condition the diffusion model in creating a compact representation and a touch shape fusion module to refine the reconstructed shape. For shape exploration, we combine the diffusion model with reinforcement learning to train a policy. This involves using the generated latent vector from the diffusion model to guide the touch exploration policy training through a novel reward design. Experiments validate the reconstruction quality thorough both qualitatively and quantitative analysis, and our touch exploration policy further boosts reconstruction performance.
zh

[CV-42] Cross-modal feature fusion for robust point cloud registration with ambiguous geometry

【速读】:该论文旨在解决点云配准中因仅依赖几何信息而导致的对称相似区域或平面结构等几何模糊场景下配准效果不佳的问题。其解决方案的关键在于提出一种新颖的跨模态特征融合方法(CoFF),通过将点云几何信息与RGB图像的辐射信息进行两阶段融合,提升点云配准的准确性与鲁棒性,具体包括像素级图像特征到3D点云的映射以及图像块特征与超点特征的结合,从而优化粗匹配并实现精确的对应关系建立。

链接: https://arxiv.org/abs/2505.13088
作者: Zhaoyi Wang,Shengyu Huang,Jemil Avers Butt,Yuanzhou Cai,Matej Varga,Andreas Wieser
机构: ETH Zürich, Institute of Geodesy and Photogrammetry (ETH Zurich, 地球测量与摄影测量研究所); Atlas optimization GmbH (Atlas优化有限公司); University of Zürich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in the ISPRS Journal of Photogrammetry and Remote Sensing. 19 pages, 14 figures

点击查看摘要

Abstract:Point cloud registration has seen significant advancements with the application of deep learning techniques. However, existing approaches often overlook the potential of integrating radiometric information from RGB images. This limitation reduces their effectiveness in aligning point clouds pairs, especially in regions where geometric data alone is insufficient. When used effectively, radiometric information can enhance the registration process by providing context that is missing from purely geometric data. In this paper, we propose CoFF, a novel Cross-modal Feature Fusion method that utilizes both point cloud geometry and RGB images for pairwise point cloud registration. Assuming that the co-registration between point clouds and RGB images is available, CoFF explicitly addresses the challenges where geometric information alone is unclear, such as in regions with symmetric similarity or planar structures, through a two-stage fusion of 3D point cloud features and 2D image features. It incorporates a cross-modal feature fusion module that assigns pixel-wise image features to 3D input point clouds to enhance learned 3D point features, and integrates patch-wise image features with superpoint features to improve the quality of coarse matching. This is followed by a coarse-to-fine matching module that accurately establishes correspondences using the fused features. We extensively evaluate CoFF on four common datasets: 3DMatch, 3DLoMatch, IndoorLRS, and the recently released ScanNet++ datasets. In addition, we assess CoFF on specific subset datasets containing geometrically ambiguous cases. Our experimental results demonstrate that CoFF achieves state-of-the-art registration performance across all benchmarks, including remarkable registration recalls of 95.9% and 81.6% on the widely-used 3DMatch and 3DLoMatch datasets, respectively…(Truncated to fit arXiv abstract length)
zh

[CV-43] Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在非平稳强化学习微调(Reinforcement Fine-Tuning, RFT)过程中出现的有害概念漂移(Concept Drift)问题,特别是在链式思维(Chain-of-Thought, CoT)推理中,推理标记分布的不可预测演化会导致最终预测中的显著偏差。解决方案的关键在于建立概念漂移理论与RFT过程之间的理论桥梁,将CoT的自回归标记流形式化为经历任意时间偏移的非平稳分布,并提出一种基于反事实感知的RFT方法——反事实偏好优化(Counterfactual Preference Optimization, CPO),通过概念图增强的大型语言模型专家生成反事实推理轨迹,系统地分离有益的分布适应与有害的概念漂移,从而实现非平稳环境下的稳定RFT。

链接: https://arxiv.org/abs/2505.13081
作者: Xiaoyu Yang,Jie Lu,En Yu
机构: Australian Artificial Intelligence Institute (AAII)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 5figures

点击查看摘要

Abstract:This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs): detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions. To address this, we are pioneers in establishing the theoretical bridge between concept drift theory and RFT processes by formalizing CoT’s autoregressive token streams as non-stationary distributions undergoing arbitrary temporal shifts. Leveraging this framework, we propose a novel counterfact-aware RFT that systematically decouples beneficial distribution adaptation from harmful concept drift through concept graph-empowered LLM experts generating counterfactual reasoning trajectories. Our solution, Counterfactual Preference Optimization (CPO), enables stable RFT in non-stationary environments, particularly within the medical domain, through custom-tuning of counterfactual-aware preference alignment. Extensive experiments demonstrate our superior performance of robustness, generalization and coordination within RFT. Besides, we also contributed a large-scale dataset CXR-CounterFact (CCF), comprising 320,416 meticulously curated counterfactual reasoning trajectories derived from MIMIC-CXR. Our code and data are public.
zh

[CV-44] 3D Visual Illusion Depth Estimation

【速读】:该论文试图解决机器视觉系统在面对3D视觉错觉(3D visual illusion)时出现的深度估计错误问题,特别是单目和双目深度估计中的误判现象。解决方案的关键在于提出一种鲁棒的深度估计框架,该框架利用视觉-语言模型(vision-language model)中的常识知识,自适应地选择来自双目视差和单目深度的可靠信息,从而提升深度估计的准确性。

链接: https://arxiv.org/abs/2505.13061
作者: CHengtang Yao,Zhidan Liu,Jiaxi Zeng,Lidong Yu,Yuwei Wu,Yunde Jia
机构: Beijing Key Laboratory of Intelligent Information Technology (北京关键信息技术重点实验室); School of Computer Science & Technology, Beijing Institute of Technology (北京理工大学计算机科学与技术学院); Guangdong Laboratory of Machine Perception and Intelligent Computing (广东省机器感知与智能计算实验室); Shenzhen MSU-BIT University (深圳美中大学-比特大学); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL

点击查看摘要

Abstract:3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a robust depth estimation framework that uses common sense from a vision-language model to adaptively select reliable depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.
zh

[CV-45] RGB-to-Polarization Estimation: A New Task and Benchmark Study

【速读】:该论文试图解决从标准RGB图像中估计偏振信息的问题(RGB-to-polarization image estimation),旨在克服传统获取偏振图像所需附加光学组件所带来的成本与复杂性问题。解决方案的关键在于建立首个全面的基准测试,通过利用现有的偏振数据集并评估多种先进的深度学习模型,系统分析不同模型家族在该任务中的性能表现与优缺点,从而为未来的研究提供参考与方向。

链接: https://arxiv.org/abs/2505.13050
作者: Beibei Lin,Zifeng Yuan,Tingting Chen
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families – such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.
zh

[CV-46] A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation

【速读】:该论文旨在解决跨域眼动估计(Cross-domain Gaze Estimation, CDGE)中模型泛化能力不足的问题,即如何将已训练好的眼动估计模型有效迁移到新的目标域。现有方法通常通过提取领域不变特征来缓解特征空间中的领域偏移,但这一策略被广义标签偏移(Generalized Label Shift, GLS)理论证明是不充分的。本文从GLS的角度重新建模跨域问题,将其视为标签和条件偏移问题,并提出了一种GLS校正框架。其关键在于引入基于截断高斯分布的重要性重加权策略,以克服标签偏移校正中的连续性挑战,同时通过概率感知的条件算子差异估计,将重加权源分布嵌入到条件不变学习中,从而提升了模型在不同域间的泛化能力和适用性。

链接: https://arxiv.org/abs/2505.13043
作者: Hao-Ran Yang,Xiaohui Chen,Chuan-Xian Ren
机构: Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aiming to generalize the well-trained gaze estimation model to new target domains, Cross-domain Gaze Estimation (CDGE) is developed for real-world application scenarios. Existing CDGE methods typically extract the domain-invariant features to mitigate domain shift in feature space, which is proved insufficient by Generalized Label Shift (GLS) theory. In this paper, we introduce a novel GLS perspective to CDGE and modelize the cross-domain problem by label and conditional shift problem. A GLS correction framework is presented and a feasible realization is proposed, in which a importance reweighting strategy based on truncated Gaussian distribution is introduced to overcome the continuity challenges in label shift correction. To embed the reweighted source distribution to conditional invariant learning, we further derive a probability-aware estimation of conditional operator discrepancy. Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domain and applicability on various models of proposed method.
zh

[CV-47] Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields in Efficient CNNs for Fair Medical Image Classification

【速读】:该论文旨在解决医学图像分类任务中两个关键问题:一是现有卷积神经网络(CNN)在捕捉多样化的病灶特征(如微小、协调、小而显著的特征)方面存在效率不足的问题,尤其是在处理不平衡数据时表现受限;二是现有CNN模型的预测结果往往存在不公平或偏倚,这在实际医疗诊断中可能带来高风险。其解决方案的关键在于提出一种新的概念——专家式异质金字塔感受野重参数化(ERoHPRF),通过设计异质金字塔感受野包来有效捕获不同病灶特征,并引入类似专家结构的重参数化技术,以多阶段策略合并参数,从而在保持计算成本和推理速度的同时提升分类性能与公平性。

链接: https://arxiv.org/abs/2505.13039
作者: Xiao Wu,Xiaoqing Zhang,Zunjie Xiao,Lingxi Hu,Risa Higashita,Jiang Liu
机构: Southern University of Science and Technology (南方科技大学); Tomey Corporation (富士胶片公司); Changchun University (长春大学); School of computer science, University of Nottingham Ningbo China (诺丁汉大学宁波分校计算机学院); School of Ophthalmology and Optometry, Wenzhou Medical University (温州医科大学眼视光学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient convolutional neural network (CNN) architecture designs have attracted growing research interests. However, they usually apply single receptive field (RF), small asymmetric RFs, or pyramid RFs to learn different feature representations, still encountering two significant challenges in medical image classification tasks: 1) They have limitations in capturing diverse lesion characteristics efficiently, e.g., tiny, coordination, small and salient, which have unique roles on results, especially imbalanced medical image classification. 2) The predictions generated by those CNNs are often unfair/biased, bringing a high risk by employing them to real-world medical diagnosis conditions. To tackle these issues, we develop a new concept, Expert-Like Reparameterization of Heterogeneous Pyramid Receptive Fields (ERoHPRF), to simultaneously boost medical image classification performance and fairness. This concept aims to mimic the multi-expert consultation mode by applying the well-designed heterogeneous pyramid RF bags to capture different lesion characteristics effectively via convolution operations with multiple heterogeneous kernel sizes. Additionally, ERoHPRF introduces an expert-like structural reparameterization technique to merge its parameters with the two-stage strategy, ensuring competitive computation cost and inference speed through comparisons to a single RF. To manifest the effectiveness and generalization ability of ERoHPRF, we incorporate it into mainstream efficient CNN architectures. The extensive experiments show that our method maintains a better trade-off than state-of-the-art methods in terms of medical image classification, fairness, and computation overhead. The codes of this paper will be released soon.
zh

[CV-48] Anti-Inpainting: A Proactive Defense against Malicious Diffusion-based Inpainters under Unknown Conditions

【速读】:该论文旨在解决扩散模型驱动的恶意图像修复(diffusion-based inpainting)在未知篡改条件下无法有效防御的问题。现有主动防御方法通常仅能在已知条件下提供保护,而无法应对由恶意用户设计的未知篡改条件。其解决方案的关键在于提出一种名为Anti-Inpainting的主动防御方法,该方法通过三重机制实现对未知条件下的有效防护:多层级深度特征提取器用于增强扩散去噪过程中的特征表达,多尺度语义保持数据增强策略提升对抗扰动的跨条件迁移能力,以及基于选择的分布偏差优化策略提高对抗扰动在不同随机种子下的鲁棒性。

链接: https://arxiv.org/abs/2505.13023
作者: Yimao Guo,Zuomin Qu,Wei Lu,Xiangyang Luo
机构: Sun Yat-sen University(中山大学); State Key Laboratory of Mathematical Engineering and Advanced Computing(数学工程与先进计算国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:As diffusion-based malicious image manipulation becomes increasingly prevalent, multiple proactive defense methods are developed to safeguard images against unauthorized tampering. However, most proactive defense methods only can safeguard images against manipulation under known conditions, and fail to protect images from manipulations guided by tampering conditions crafted by malicious users. To tackle this issue, we propose Anti-Inpainting, a proactive defense method that achieves adequate protection under unknown conditions through a triple mechanism to address this challenge. Specifically, a multi-level deep feature extractor is presented to obtain intricate features during the diffusion denoising process to improve protective effectiveness. We design multi-scale semantic-preserving data augmentation to enhance the transferability of adversarial perturbations across unknown conditions by multi-scale transformations while preserving semantic integrity. In addition, we propose a selection-based distribution deviation optimization strategy to improve the protection of adversarial perturbation against manipulation under diverse random seeds. Extensive experiments indicate the proactive defensive performance of Anti-Inpainting against diffusion-based inpainters guided by unknown conditions in InpaintGuardBench and CelebA-HQ. At the same time, we also demonstrate the proposed approach’s robustness under various image purification methods and its transferability across different versions of diffusion models.
zh

[CV-49] A Skull-Adaptive Framework for AI-Based 3D Transcranial Focused Ultrasound Simulation

【速读】:该论文旨在解决经颅聚焦超声(tFUS)在非侵入性脑刺激和治疗中的空间精度与深度目标定位问题,特别是由于人类颅骨的异质性和各向异性导致的超声波前显著失真问题。为实现数据驱动的方法,研究者提出了TFUScapes数据集,这是首个基于T1加权MRI图像生成的高分辨率、大规模tFUS仿真数据集,并开发了DeepTFUS模型,该模型通过结合U-Net主干网络与超声探头感知的条件编码,利用傅里叶编码的位置嵌入和MLP层生成全局探头嵌入,进而通过特征调制、动态卷积和交叉注意力机制与编码器特征融合,从而直接从输入的3D CT体积和探头位置估计归一化压力场。解决方案的关键在于构建高质量的仿真数据集以及设计能够有效建模探头位置与压力场关系的深度学习架构。

链接: https://arxiv.org/abs/2505.12998
作者: Vinkle Srivastav,Juliette Puel,Jonathan Vappou,Elijah Van Houten,Paolo Cabras,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France; Image Guided Therapy, Pessac, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is available at this https URL

点击查看摘要

Abstract:Transcranial focused ultrasound (tFUS) is an emerging modality for non-invasive brain stimulation and therapeutic intervention, offering millimeter-scale spatial precision and the ability to target deep brain structures. However, the heterogeneous and anisotropic nature of the human skull introduces significant distortions to the propagating ultrasound wavefront, which require time-consuming patient-specific planning and corrections using numerical solvers for accurate targeting. To enable data-driven approaches in this domain, we introduce TFUScapes, the first large-scale, high-resolution dataset of tFUS simulations through anatomically realistic human skulls derived from T1-weighted MRI images. We have developed a scalable simulation engine pipeline using the k-Wave pseudo-spectral solver, where each simulation returns a steady-state pressure field generated by a focused ultrasound transducer placed at realistic scalp locations. In addition to the dataset, we present DeepTFUS, a deep learning model that estimates normalized pressure fields directly from input 3D CT volumes and transducer position. The model extends a U-Net backbone with transducer-aware conditioning, incorporating Fourier-encoded position embeddings and MLP layers to create global transducer embeddings. These embeddings are fused with U-Net encoder features via feature-wise modulation, dynamic convolutions, and cross-attention mechanisms. The model is trained using a combination of spatially weighted and gradient-sensitive loss functions, enabling it to approximate high-fidelity wavefields. The TFUScapes dataset is publicly released to accelerate research at the intersection of computational acoustics, neurotechnology, and deep learning. The project page is available at this https URL.
zh

[CV-50] Multiscale Adaptive Conflict-Balancing Model For Multimedia Deepfake Detection ICMR

【速读】:该论文旨在解决深度伪造(deepfake)与真实媒体之间界限模糊导致的多媒体可信度下降问题,特别是当前多模态检测方法在模态间存在不平衡学习的问题。其解决方案的关键在于提出一种音视频联合学习方法(MACB-DF),通过对比学习辅助多层次和跨模态融合,以更好地缓解模态冲突与信息忽视,同时设计了一个正交化多模态帕累托模块,以保留单模态信息并解决音频视频编码器中因损失函数优化目标差异导致的梯度冲突。

链接: https://arxiv.org/abs/2505.12966
作者: Zihan Xiong,Xiaohua Wu,Lei Chen,Fangqi Lou
机构: University of Electronic Science and Technology of China(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages,ICMR accepted

点击查看摘要

Abstract:Advances in computer vision and deep learning have blurred the line between deepfakes and authentic media, undermining multimedia credibility through audio-visual forgery. Current multimodal detection methods remain limited by unbalanced learning between modalities. To tackle this issue, we propose an Audio-Visual Joint Learning Method (MACB-DF) to better mitigate modality conflicts and neglect by leveraging contrastive learning to assist in multi-level and cross-modal fusion, thereby fully balancing and exploiting information from each modality. Additionally, we designed an orthogonalization-multimodal pareto module that preserves unimodal information while addressing gradient conflicts in audio-video encoders caused by differing optimization targets of the loss functions. Extensive experiments and ablation studies conducted on mainstream deepfake datasets demonstrate consistent performance gains of our model across key evaluation metrics, achieving an average accuracy of 95.5% across multiple datasets. Notably, our method exhibits superior cross-dataset generalization capabilities, with absolute improvements of 8.0% and 7.7% in ACC scores over the previous best-performing approach when trained on DFDC and tested on DefakeAVMiT and FakeAVCeleb datasets.
zh

[CV-51] CALM-PDE: Continuous and Adaptive Convolutions for Latent Space Modeling of Time-dependent PDEs

【速读】:该论文旨在解决在密集离散化空间域中求解时变偏微分方程(Partial Differential Equations, PDEs)的计算成本过高的问题。现有方法虽然通过在压缩潜在空间中构建神经代理模型来降低计算复杂度,但基于Transformer的注意力机制在处理不规则采样域时会导致内存消耗增加。相比之下,卷积神经网络虽具备内存效率优势,但仅适用于规则离散化。为克服上述限制,该论文提出的CALM-PDE模型采用了一种基于连续卷积的编码器-解码器架构,其关键在于使用ε邻域约束核,并学习在自适应优化的查询点上应用卷积算子,从而高效求解任意离散化的PDE。

链接: https://arxiv.org/abs/2505.12944
作者: Jan Hagnberger,Daniel Musekamp,Mathias Niepert
机构: University of Stuttgart (斯图加特大学); International Max Planck Research School for Intelligent Systems (IMPRS-IS) (国际马克斯·普朗克智能系统研究学校); Stuttgart Center for Simulation Science (SimTech) (斯图加特仿真科学中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Solving time-dependent Partial Differential Equations (PDEs) using a densely discretized spatial domain is a fundamental problem in various scientific and engineering disciplines, including modeling climate phenomena and fluid dynamics. However, performing these computations directly in the physical space often incurs significant computational costs. To address this issue, several neural surrogate models have been developed that operate in a compressed latent space to solve the PDE. While these approaches reduce computational complexity, they often use Transformer-based attention mechanisms to handle irregularly sampled domains, resulting in increased memory consumption. In contrast, convolutional neural networks allow memory-efficient encoding and decoding but are limited to regular discretizations. Motivated by these considerations, we propose CALM-PDE, a model class that efficiently solves arbitrarily discretized PDEs in a compressed latent space. We introduce a novel continuous convolution-based encoder-decoder architecture that uses an epsilon-neighborhood-constrained kernel and learns to apply the convolution operator to adaptive and optimized query points. We demonstrate the effectiveness of CALM-PDE on a diverse set of PDEs with both regularly and irregularly sampled spatial domains. CALM-PDE is competitive with or outperforms existing baseline methods while offering significant improvements in memory and inference time efficiency compared to Transformer-based methods.
zh

[CV-52] LatentINDIGO: An INN-Guided Latent Diffusion Algorithm for Image Restoration

【速读】:该论文旨在解决图像恢复(Image Restoration, IR)任务中基于潜在扩散模型(Latent Diffusion Models, LDMs)的两个关键问题:一是现有方法依赖于预定义的退化算子,难以处理复杂或未知的退化;二是多数方法在潜在空间中难以提供稳定的引导,并且在每次采样迭代中需将潜在表示转换回像素域进行引导,导致计算和内存开销显著增加。解决方案的关键在于引入一种受小波启发的可逆神经网络(Invertible Neural Network, INN),通过正向变换模拟退化并利用反向变换重建丢失细节,同时将其集成到潜在扩散流程中,提出两种方法:LatentINDIGO-PixelINN 在像素域操作,而 LatentINDIGO-LatentINN 完全在潜在空间中运行以降低复杂度,从而有效提升图像恢复性能。

链接: https://arxiv.org/abs/2505.12935
作者: Di You,Daniel Siromani,Pier Luigi Dragotti
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Image Processing (TIP)

点击查看摘要

Abstract:There is a growing interest in the use of latent diffusion models (LDMs) for image restoration (IR) tasks due to their ability to model effectively the distribution of natural images. While significant progress has been made, there are still key challenges that need to be addressed. First, many approaches depend on a predefined degradation operator, making them ill-suited for complex or unknown degradations that deviate from standard analytical models. Second, many methods struggle to provide a stable guidance in the latent space and finally most methods convert latent representations back to the pixel domain for guidance at every sampling iteration, which significantly increases computational and memory overhead. To overcome these limitations, we introduce a wavelet-inspired invertible neural network (INN) that simulates degradations through a forward transform and reconstructs lost details via the inverse transform. We further integrate this design into a latent diffusion pipeline through two proposed approaches: LatentINDIGO-PixelINN, which operates in the pixel domain, and LatentINDIGO-LatentINN, which stays fully in the latent space to reduce complexity. Both approaches alternate between updating intermediate latent variables under the guidance of our INN and refining the INN forward model to handle unknown degradations. In addition, a regularization step preserves the proximity of latent variables to the natural image manifold. Experiments demonstrate that our algorithm achieves state-of-the-art performance on synthetic and real-world low-quality images, and can be readily adapted to arbitrary output sizes.
zh

[CV-53] Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption

【速读】:该论文试图解决预训练视觉-语言模型(如CLIP)在面对与训练数据分布差异较大的测试数据时,性能下降的问题,特别是传感器退化(sensor degradation)这一现实中的分布偏移问题。现有方法在应对细粒度标签空间或输入图像的不同表现形式等分布偏移时表现良好,但无法有效适应由传感器条件引起的分布偏移。解决方案的关键在于提出一种新的测试时自适应(test-time adaptation, TTA)方法,称为统一性感知的信息平衡TTA(UnInfo),通过引入统一性感知的置信度最大化、信息感知的损失平衡以及来自指数移动平均教师的知识蒸馏,以恢复图像嵌入在统一性上的信息完整性,从而提升模型对传感器退化场景的鲁棒性。

链接: https://arxiv.org/abs/2505.12912
作者: Kazuki Adachi,Shin’ya Yamaguchi,Tomoki Hamagami
机构: NTT, Inc.(NTT公司); Kyoto University(京都大学); Yokohama National University(横滨国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Pre-trained vision-language models such as contrastive language-image pre-training (CLIP) have demonstrated a remarkable generalizability, which has enabled a wide range of applications represented by zero-shot classification. However, vision-language models still suffer when they face datasets with large gaps from training ones, i.e., distribution shifts. We found that CLIP is especially vulnerable to sensor degradation, a type of realistic distribution shift caused by sensor conditions such as weather, light, or noise. Collecting a new dataset from a test distribution for fine-tuning highly costs since sensor degradation occurs unexpectedly and has a range of variety. Thus, we investigate test-time adaptation (TTA) of zero-shot classification, which enables on-the-fly adaptation to the test distribution with unlabeled test data. Existing TTA methods for CLIP mainly focus on modifying image and text embeddings or predictions to address distribution shifts. Although these methods can adapt to domain shifts, such as fine-grained labels spaces or different renditions in input images, they fail to adapt to distribution shifts caused by sensor degradation. We found that this is because image embeddings are “corrupted” in terms of uniformity, a measure related to the amount of information. To make models robust to sensor degradation, we propose a novel method called uniformity-aware information-balanced TTA (UnInfo). To address the corruption of image embeddings, we introduce uniformity-aware confidence maximization, information-aware loss balancing, and knowledge distillation from the exponential moving average (EMA) teacher. Through experiments, we demonstrate that our UnInfo improves accuracy under sensor degradation by retaining information in terms of uniformity.
zh

[CV-54] HiERO: understanding the hierarchy of human behavior enhances reasoning on egocentric videos

【速读】:该论文试图解决深度学习模型在推理人类活动时面临的挑战,即人类活动的复杂性和变异性使得模型难以有效理解其内在结构。解决方案的关键在于利用未脚本化视频中隐含的层次化活动模式,通过弱监督方法HiERO将视频片段的特征与对应的层次化活动线程进行增强,从而提升对视频内容的上下文、语义和时间推理能力。

链接: https://arxiv.org/abs/2505.12911
作者: Simone Alberto Peirone,Francesca Pistilli,Giuseppe Averta
机构: Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page this https URL

点击查看摘要

Abstract:Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
zh

[CV-55] Dynamic Graph Induced Contour-aware Heat Conduction Network for Event-based Object Detection

【速读】:该论文旨在解决基于事件流(event stream)的物体检测中,现有算法在建模目标轮廓信息和利用多尺度特征方面存在的不足。当前方法主要依赖卷积神经网络或Transformer,前者受限于局部特征提取能力,后者则因自注意力机制导致计算成本过高。为解决这些问题,本文提出了一种动态图诱导的轮廓感知热传导网络(CvHeat-DET),其关键在于通过事件流中的清晰轮廓信息预测热扩散系数,并结合分层结构图特征以增强多尺度特征学习能力。

链接: https://arxiv.org/abs/2505.12908
作者: Xiao Wang,Yu Jin,Lan Chen,Bo Jiang,Lin Zhu,Yonghong Tian,Jin Tang,Bin Luo
机构: Anhui University (安徽大学); Beijing Institute of Technology (北京理工大学); Peng Cheng Laboratory (鹏城实验室); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Event-based Vision Sensors (EVS) have demonstrated significant advantages over traditional RGB frame-based cameras in low-light conditions, high-speed motion capture, and low latency. Consequently, object detection based on EVS has attracted increasing attention from researchers. Current event stream object detection algorithms are typically built upon Convolutional Neural Networks (CNNs) or Transformers, which either capture limited local features using convolutional filters or incur high computational costs due to the utilization of self-attention. Recently proposed vision heat conduction backbone networks have shown a good balance between efficiency and accuracy; however, these models are not specifically designed for event stream data. They exhibit weak capability in modeling object contour information and fail to exploit the benefits of multi-scale features. To address these issues, this paper proposes a novel dynamic graph induced contour-aware heat conduction network for event stream based object detection, termed CvHeat-DET. The proposed model effectively leverages the clear contour information inherent in event streams to predict the thermal diffusivity coefficients within the heat conduction model, and integrates hierarchical structural graph features to enhance feature learning across multiple scales. Extensive experiments on three benchmark datasets for event stream-based object detection fully validated the effectiveness of the proposed model. The source code of this paper will be released on this https URL.
zh

[CV-56] owards Low-Latency Event Stream-based Visual Object Tracking: A Slow-Fast Approach

【速读】:该论文旨在解决传统跟踪算法在低延迟和资源受限环境中性能不足的问题,传统方法通常依赖于低帧率的RGB相机和计算密集型深度神经网络架构。其解决方案的关键在于提出了一种名为SFTrack的新型慢-快跟踪范式,该框架通过从高时间分辨率事件流中进行图表示学习,并将其集成到基于FlashAttention的视觉主干网络中,从而生成高精度慢速跟踪器和高效快速跟踪器,其中快速跟踪器通过轻量级网络设计和单次前向传播生成多个边界框输出以实现低延迟。

链接: https://arxiv.org/abs/2505.12903
作者: Shiao Wang,Xiao Wang,Liye Jin,Bo Jiang,Lin Zhu,Lan Chen,Yonghong Tian,Bin Luo
机构: Anhui University(安徽大学); Beijing Institute of Technology(北京理工大学); Peng Cheng Laboratory(鹏城实验室); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing tracking algorithms typically rely on low-frame-rate RGB cameras coupled with computationally intensive deep neural network architectures to achieve effective tracking. However, such frame-based methods inherently face challenges in achieving low-latency performance and often fail in resource-constrained environments. Visual object tracking using bio-inspired event cameras has emerged as a promising research direction in recent years, offering distinct advantages for low-latency applications. In this paper, we propose a novel Slow-Fast Tracking paradigm that flexibly adapts to different operational requirements, termed SFTrack. The proposed framework supports two complementary modes, i.e., a high-precision slow tracker for scenarios with sufficient computational resources, and an efficient fast tracker tailored for latency-aware, resource-constrained environments. Specifically, our framework first performs graph-based representation learning from high-temporal-resolution event streams, and then integrates the learned graph-structured information into two FlashAttention-based vision backbones, yielding the slow and fast trackers, respectively. The fast tracker achieves low latency through a lightweight network design and by producing multiple bounding box outputs in a single forward pass. Finally, we seamlessly combine both trackers via supervised fine-tuning and further enhance the fast tracker’s performance through a knowledge distillation strategy. Extensive experiments on public benchmarks, including FE240, COESOT, and EventVOT, demonstrate the effectiveness and efficiency of our proposed method across different real-world scenarios. The source code has been released on this https URL.
zh

[CV-57] EPIC: Explanation of Pretrained Image Classification Networks via Prototype

【速读】:该论文试图解决传统可解释人工智能(Explainable AI, XAI)方法在解释能力与灵活性之间的平衡问题。具体而言,后置型(post-hoc)方法虽然兼容多种模型架构,但通常只能提供粗略的决策过程理解;而前置型(ante-hoc)方法虽能通过原型(prototype)提供更直观的解释,但受限于特定架构和训练流程,适用性较差。论文提出的EPIC方法的关键在于,它作为一种后置型方法,在不修改预训练模型架构的前提下,结合了前置型方法中基于原型的解释思想,实现了对模型决策的直观、高质量解释,从而弥补了两类方法之间的差距。

链接: https://arxiv.org/abs/2505.12897
作者: Piotr Borycki,Magdalena Trędowicz,Szymon Janusz,Jacek Tabor,Przemysław Spurek,Arkadiusz Lewicki,Łukasz Struski
机构: Jagiellonian University (亚捷隆大学); University of Information Technology and Management in Rzeszów (热舒夫信息科技与管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) methods generally fall into two categories. Post-hoc approaches generate explanations for pre-trained models and are compatible with various neural network architectures. These methods often use feature importance visualizations, such as saliency maps, to indicate which input regions influenced the model’s prediction. Unfortunately, they typically offer a coarse understanding of the model’s decision-making process. In contrast, ante-hoc (inherently explainable) methods rely on specially designed model architectures trained from scratch. A notable subclass of these methods provides explanations through prototypes, representative patches extracted from the training data. However, prototype-based approaches have limitations: they require dedicated architectures, involve specialized training procedures, and perform well only on specific datasets. In this work, we propose EPIC (Explanation of Pretrained Image Classification), a novel approach that bridges the gap between these two paradigms. Like post-hoc methods, EPIC operates on pre-trained models without architectural modifications. Simultaneously, it delivers intuitive, prototype-based explanations inspired by ante-hoc techniques. To the best of our knowledge, EPIC is the first post-hoc method capable of fully replicating the core explanatory power of inherently interpretable models. We evaluate EPIC on benchmark datasets commonly used in prototype-based explanations, such as CUB-200-2011 and Stanford Cars, alongside large-scale datasets like ImageNet, typically employed by post-hoc methods. EPIC uses prototypes to explain model decisions, providing a flexible and easy-to-understand tool for creating clear, high-quality explanations.
zh

[CV-58] ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling

【速读】:该论文旨在解决手术室(Operating Room, OR)中计算系统缺乏深度和整体理解能力的问题,以实现精准、安全和有效的干预。现有研究多局限于单一任务,如阶段识别或场景图生成,缺乏广度和泛化能力。其解决方案的关键在于提出ORQA,一个面向OR问题回答的基准和基础多模态模型,通过整合四个公开的OR数据集构建综合性基准,并融合视觉、听觉和结构化数据等多源信号,实现对OR环境的全面建模。此外,还引入了一种渐进式知识蒸馏范式,以生成适应不同速度和内存需求的模型家族,从而推动可扩展、统一的OR建模和多模态外科智能的发展。

链接: https://arxiv.org/abs/2505.12890
作者: Ege Özsoy,Chantal Pellegrini,David Bani-Harouni,Kun Yuan,Matthias Keicher,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The real-world complexity of surgeries necessitates surgeons to have deep and holistic comprehension to ensure precision, safety, and effective interventions. Computational systems are required to have a similar level of comprehension within the operating room. Prior works, limited to single-task efforts like phase recognition or scene graph generation, lack scope and generalizability. In this work, we introduce ORQA, a novel OR question answering benchmark and foundational multimodal model to advance OR intelligence. By unifying all four public OR datasets into a comprehensive benchmark, we enable our approach to concurrently address a diverse range of OR challenges. The proposed multimodal large language model fuses diverse OR signals such as visual, auditory, and structured data, for a holistic modeling of the OR. Finally, we propose a novel, progressive knowledge distillation paradigm, to generate a family of models optimized for different speed and memory requirements. We show the strong performance of ORQA on our proposed benchmark, and its zero-shot generalization, paving the way for scalable, unified OR modeling and significantly advancing multimodal surgical intelligence. We will release our code and data upon acceptance.
zh

[CV-59] nyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

【速读】:该论文试图解决轻量级视觉-语言模型(Lightweight Vision-Language Models, VLMs)在多模态对齐中的瓶颈问题,即由于语言模型的表征能力受限,导致有效互信息(Effective Mutual Information, EMI)不足,从而影响对齐质量。解决方案的关键在于提出一种名为TinyAlign的新框架,该框架受检索增强生成(Retrieval-Augmented Generation)启发,通过从记忆库中检索相关上下文来丰富多模态输入,从而提升其对齐效果。

链接: https://arxiv.org/abs/2505.12884
作者: Yuanze Hu,Zhaoxin Fan,Xinyu Wang,Gen Li,Ye Qiu,Zhichao Yang,Wenjun Wu,Kejian Wu,Yifan Sun,Xiaotie Deng,Jin Dong
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University (北京航空航天大学未来区块链与隐私计算高精尖中心); Hangzhou International Innovation Institute, Beihang University (杭州国际创新研究院,北京航空航天大学); Xreal; Renmin University (中国人民大学); Peking University (北京大学); Beijing Academy of Blockchain and Edge Computing (BABEC) (北京区块链与边缘计算研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight models with limited representational capacity. In this work, we investigate this alignment bottleneck through the lens of mutual information, demonstrating that the constrained capacity of the language model inherently limits the Effective Mutual Information (EMI) between multimodal inputs and outputs, thereby compromising alignment quality. To address this challenge, we propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank to enrich multimodal inputs and enhance their alignment. Extensive empirical evaluations reveal that TinyAlign significantly reduces training loss, accelerates convergence, and enhances task performance. Remarkably, it allows models to achieve baseline-level performance with only 40% of the fine-tuning data, highlighting exceptional data efficiency. Our work thus offers a practical pathway for developing more capable lightweight VLMs while introducing a fresh theoretical lens to better understand and address alignment bottlenecks in constrained multimodal systems.
zh

[CV-60] Unified Cross-modal Translation of Score Images Symbolic Music and Performance Audio

【速读】:该论文旨在解决音乐信息检索中多模态翻译任务的独立建模问题,即以往研究通常针对每个翻译任务训练专用模型,而缺乏一种统一的解决方案。其关键解决方案是提出一个统一的多任务学习框架,通过在大量翻译任务上同时训练通用模型来实现跨模态翻译的协同优化。该方法依赖于两个核心要素:一是构建了一个包含超过1,300小时配对音频-乐谱图像数据的大规模数据集;二是采用统一的分词框架,将乐谱图像、音频、MIDI和MusicXML等不同模态离散化为一系列标记,从而使得单一的编码器-解码器Transformer模型能够将多种跨模态翻译任务视为一个连贯的序列到序列任务进行处理。

链接: https://arxiv.org/abs/2505.12863
作者: Jongmin Jung,Dongmin Kim,Sihun Lee,Seola Cho,Hyungjoon Soh,Irmak Bukey,Chris Donahue,Dasaem Jeong
机构: Sogang University(首尔大学); Seoul National University(首尔国立大学); Carnegie Mellon University(卡内基梅隆大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Submitted to IEEE Transactions on Audio, Speech and Language Processing (TASLPRO)

点击查看摘要

Abstract:Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large-scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of tokens, enabling a single encoder-decoder Transformer to tackle multiple cross-modal translation as one coherent sequence-to-sequence task. Experimental results confirm that our unified multitask model improves upon single-task baselines in several key areas, notably reducing the symbol error rate for optical music recognition from 24.58% to a state-of-the-art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score-image-conditioned audio generation, marking a significant breakthrough in cross-modal music generation.
zh

[CV-61] Robust Multimodal Segmentation with Representation Regularization and Hybrid Prototype Distillation

【速读】:该论文旨在解决多模态语义分割(Multi-modal Semantic Segmentation, MMSS)在实际应用场景中因动态环境、传感器故障和噪声干扰导致的性能与理论模型之间的差距问题。其解决方案的关键在于提出一种两阶段框架RobustSeg,该框架通过两个核心组件提升多模态的鲁棒性:混合原型蒸馏模块(Hybrid Prototype Distillation Module, HPDM)和表征正则化模块(Representation Regularization Module, RRM)。HPDM通过将特征转换为紧凑原型,实现跨模态的混合知识蒸馏并缓解模态缺失带来的偏差,而RRM则通过优化功能熵来减少教师模型与学生模型之间的表征差异。

链接: https://arxiv.org/abs/2505.12861
作者: Jiaqi Tan,Xu Zheng,Yang Liu
机构: BUPT(北京邮电大学); HKUST(GZ)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal semantic segmentation (MMSS) faces significant challenges in real-world scenarios due to dynamic environments, sensor failures, and noise interference, creating a gap between theoretical models and practical performance. To address this, we propose a two-stage framework called RobustSeg, which enhances multi-modal robustness through two key components: the Hybrid Prototype Distillation Module (HPDM) and the Representation Regularization Module (RRM). In the first stage, RobustSeg pre-trains a multi-modal teacher model using complete modalities. In the second stage, a student model is trained with random modality dropout while learning from the teacher via HPDM and RRM. HPDM transforms features into compact prototypes, enabling cross-modal hybrid knowledge distillation and mitigating bias from missing modalities. RRM reduces representation discrepancies between the teacher and student by optimizing functional entropy through the log-Sobolev inequality. Extensive experiments on three public benchmarks demonstrate that RobustSeg outperforms previous state-of-the-art methods, achieving improvements of +2.76%, +4.56%, and +0.98%, respectively. Code is available at: this https URL.
zh

[CV-62] owards a Universal Image Degradation Model via Content-Degradation Disentanglement

【速读】:该论文旨在解决现有图像退化合成模型在泛化能力和适应性方面的不足,这些模型通常只能生成特定或有限类型的退化,并且需要用户提供的退化参数。其解决方案的关键在于提出一种通用的退化模型,能够自动提取并解耦均匀(全局)和非均匀(空间变化)的退化特征,从而实现复杂且现实的退化合成,而无需用户干预。该方法通过压缩解耦策略分离退化信息,并引入两个新模块以提取和整合非均匀退化,提升了模型在电影颗粒模拟和盲图像恢复任务中的准确性和适应性。

链接: https://arxiv.org/abs/2505.12860
作者: Wenbo Yang,Zhongling Wang,Zhou Wang
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Image degradation synthesis is highly desirable in a wide variety of applications ranging from image restoration to simulating artistic effects. Existing models are designed to generate one specific or a narrow set of degradations, which often require user-provided degradation parameters. As a result, they lack the generalizability to synthesize degradations beyond their initial design or adapt to other applications. Here we propose the first universal degradation model that can synthesize a broad spectrum of complex and realistic degradations containing both homogeneous (global) and inhomogeneous (spatially varying) components. Our model automatically extracts and disentangles homogeneous and inhomogeneous degradation features, which are later used for degradation synthesis without user intervention. A disentangle-by-compression method is proposed to separate degradation information from images. Two novel modules for extracting and incorporating inhomogeneous degradations are created to model inhomogeneous components in complex degradations. We demonstrate the model’s accuracy and adaptability in film-grain simulation and blind image restoration tasks. The demo video, code, and dataset of this project will be released upon publication at this http URL.
zh

[CV-63] he Way Up: A Dataset for Hold Usage Detection in Sport Climbing CVPR2025

【速读】:该论文试图解决在运动攀岩中检测运动员在路线上的位置以及识别握点使用的问题(hold usage),当前缺乏带有详细握点使用标注的攀岩数据集。解决方案的关键在于引入一个包含22个标注攀岩视频的数据集,提供握点位置、使用顺序及使用时间的地面真实标签,并探索基于关键点的2D姿态估计模型来检测握点使用,通过分析特定关节的关键点及其与攀岩握点的重叠来确定使用情况。

链接: https://arxiv.org/abs/2505.12854
作者: Anna Maschek,David C. Schedl
机构: University of Applied Sciences Upper Austria, Campus Hagenberg, Austria
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at the International Workshop on Computer Vision in Sports (CVsports) at CVPR 2025

点击查看摘要

Abstract:Detecting an athlete’s position on a route and identifying hold usage are crucial in various climbing-related applications. However, no climbing dataset with detailed hold usage annotations exists to our knowledge. To address this issue, we introduce a dataset of 22 annotated climbing videos, providing ground-truth labels for hold locations, usage order, and time of use. Furthermore, we explore the application of keypoint-based 2D pose-estimation models for detecting hold usage in sport climbing. We determine usage by analyzing the key points of certain joints and the corresponding overlap with climbing holds. We evaluate multiple state-of-the-art models and analyze their accuracy on our dataset, identifying and highlighting climbing-specific challenges. Our dataset and results highlight key challenges in climbing-specific pose estimation and establish a foundation for future research toward AI-assisted systems for sports climbing.
zh

[CV-64] Accelerate TarFlow Sampling with GS-Jacobi Iteration

【速读】:该论文试图解决TarFlow模型在图像生成任务中采样过程速度缓慢的问题,这一问题源于其因果注意力机制需要顺序计算。解决方案的关键在于引入Gauss-Seidel-Jacobi(GS-Jacobi)迭代方法,并结合两种优化策略:Convergence Ranking Metric (CRM) 和 Initial Guessing Metric (IGM)。CRM用于识别TarFlow模块的收敛特性,而IGM用于评估迭代初始值的质量,从而有效提升采样效率并保持生成图像的质量。

链接: https://arxiv.org/abs/2505.12849
作者: Ben Liu,Zhen Qin
机构: TapTap( TapTap); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Image generation models have achieved widespread applications. As an instance, the TarFlow model combines the transformer architecture with Normalizing Flow models, achieving state-of-the-art results on multiple benchmarks. However, due to the causal form of attention requiring sequential computation, TarFlow’s sampling process is extremely slow. In this paper, we demonstrate that through a series of optimization strategies, TarFlow sampling can be greatly accelerated by using the Gauss-Seidel-Jacobi (abbreviated as GS-Jacobi) iteration method. Specifically, we find that blocks in the TarFlow model have varying importance: a small number of blocks play a major role in image generation tasks, while other blocks contribute relatively little; some blocks are sensitive to initial values and prone to numerical overflow, while others are relatively robust. Based on these two characteristics, we propose the Convergence Ranking Metric (CRM) and the Initial Guessing Metric (IGM): CRM is used to identify whether a TarFlow block is “simple” (converges in few iterations) or “tough” (requires more iterations); IGM is used to evaluate whether the initial value of the iteration is good. Experiments on four TarFlow models demonstrate that GS-Jacobi sampling can significantly enhance sampling efficiency while maintaining the quality of generated images (measured by FID), achieving speed-ups of 4.53x in Img128cond, 5.32x in AFHQ, 2.96x in Img64uncond, and 2.51x in Img64cond without degrading FID scores or sample quality. Code and checkpoints are accessible on this https URL
zh

[CV-65] A Study on the Refining Handwritten Font by Mixing Font Styles

【速读】:该论文试图解决手写字体因笔迹不清或不一致而导致可读性差的问题。解决方案的关键在于提出一种名为FontFusionGAN (FFGAN) 的新方法,该方法通过结合生成对抗网络 (Generative Adversarial Network, GAN) 来生成融合手写字体和印刷字体优点的字体。该方法在手写与印刷字体数据集上进行训练,能够生成清晰且视觉吸引力强的字体图像,从而显著提升原始字体的可读性,同时保留其独特的美学特征。

链接: https://arxiv.org/abs/2505.12834
作者: Avinash Kumar,Kyeolhee Kang,Ammar ul Hassan,Jaeyoung Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures, MITA 2023 (The 19th International Conference on Multimedia Information Technology and Applications July. 11 ~ July 14, 2023, Technical University of Ostrava, Ostrava, Czech)

点击查看摘要

Abstract:Handwritten fonts have a distinct expressive character, but they are often difficult to read due to unclear or inconsistent handwriting. FontFusionGAN (FFGAN) is a novel method for improving handwritten fonts by combining them with printed fonts. Our method implements generative adversarial network (GAN) to generate font that mix the desirable features of handwritten and printed fonts. By training the GAN on a dataset of handwritten and printed fonts, it can generate legible and visually appealing font images. We apply our method to a dataset of handwritten fonts and demonstrate that it significantly enhances the readability of the original fonts while preserving their unique aesthetic. Our method has the potential to improve the readability of handwritten fonts, which would be helpful for a variety of applications including document creation, letter writing, and assisting individuals with reading and writing difficulties. In addition to addressing the difficulties of font creation for languages with complex character sets, our method is applicable to other text-image-related tasks, such as font attribute control and multilingual font style transfer.
zh

[CV-66] Mitigating Hallucination in VideoLLM s via Temporal-Aware Activation Engineering

【速读】:该论文试图解决视频领域中多模态大语言模型(Multimodal Large Language Models, MLLMs)存在的幻觉(hallucination)问题,即模型生成看似合理但不准确的输出。现有解决方案中,激活工程(activation engineering)在文本和图像大语言模型中已被证明有效,但其在视频大语言模型(VideoLLMs)中的适用性尚未被系统研究。该论文的关键在于首次系统地探究了激活工程在视频大语言模型中缓解幻觉的有效性及其机制,发现模型对幻觉的敏感性主要取决于时间变化(temporal variation)而非任务类型,并提出了一种基于时间感知的激活工程框架,通过自适应识别和操作与幻觉相关的模块来显著减少幻觉,而无需额外的模型微调。

链接: https://arxiv.org/abs/2505.12826
作者: Jianfeng Cai,Wengang Zhou,Zongmeng Zhang,Jiale Hong,Nianji Zhan,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Jiaotong University (上海交通大学); Merchants Union Consumer Finance Company Limited (商户联盟消费金融公司 Limited)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in video this http URL, hallucination, where the model generates plausible yet incorrect outputs, persists as a significant and under-addressed challenge in the video domain. Among existing solutions, activation engineering has proven successful in mitigating hallucinations in LLMs and ImageLLMs, yet its applicability to VideoLLMs remains largely unexplored. In this work, we are the first to systematically investigate the effectiveness and underlying mechanisms of activation engineering for mitigating hallucinations in VideoLLMs. We initially conduct an investigation of the key factors affecting the performance of activation engineering and find that a model’s sensitivity to hallucination depends on \textbftemporal variation rather than task type. Moreover, selecting appropriate internal modules and dataset for activation engineering is critical for reducing hallucination. Guided by these findings, we propose a temporal-aware activation engineering framework for VideoLLMs, which adaptively identifies and manipulates hallucination-sensitive modules based on the temporal variation characteristic, substantially mitigating hallucinations without additional LLM fine-tuning. Experiments across multiple models and benchmarks demonstrate that our method markedly reduces hallucination in VideoLLMs, thereby validating the robustness of our findings.
zh

[CV-67] Rethinking Features-Fused-Pyramid-Neck for Object Detection ECCV2024

【速读】:该论文旨在解决多头检测器中因特征金字塔不同层级表示强制点对点融合导致的特征错位问题。其解决方案的关键在于设计了一个独立层级金字塔(IHP)架构,以评估无特征融合的金字塔颈部的有效性,并引入软最近邻插值(SNI)方法结合权重缩放因子,以减轻不同层级特征融合的影响,同时保留关键纹理信息。此外,还提出了扩展空间窗口下的特征自适应选择方法(ESD),以保留空间特征并增强轻量级卷积技术(GSConvE)。这些改进最终形成了二次特征对齐方案(SA),实现了实时检测的最优性能。

链接: https://arxiv.org/abs/2505.12820
作者: Hulin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024

点击查看摘要

Abstract:Multi-head detectors typically employ a features-fused-pyramid-neck for multi-scale detection and are widely adopted in the industry. However, this approach faces feature misalignment when representations from different hierarchical levels of the feature pyramid are forcibly fused point-to-point. To address this issue, we designed an independent hierarchy pyramid (IHP) architecture to evaluate the effectiveness of the features-unfused-pyramid-neck for multi-head detectors. Subsequently, we introduced soft nearest neighbor interpolation (SNI) with a weight downscaling factor to mitigate the impact of feature fusion at different hierarchies while preserving key textures. Furthermore, we present a features adaptive selection method for down sampling in extended spatial windows (ESD) to retain spatial features and enhance lightweight convolutional techniques (GSConvE). These advancements culminate in our secondary features alignment solution (SA) for real-time detection, achieving state-of-the-art results on Pascal VOC and MS COCO. Code will be released at this https URL. This paper has been accepted by ECCV2024 and published on Springer Nature.
zh

[CV-68] Informed Mixing – Improving Open Set Recognition via Attribution-based Augmentation

【速读】:该论文试图解决开放集识别(Open Set Recognition, OSR)中检测未知类别的问题,这一问题在当前视觉模型中仍是一个未解难题。解决方案的关键在于提出一种名为GradMix的数据增强方法,该方法通过在训练过程中动态利用模型的基于梯度的归因图(gradient-based attribution maps)来遮蔽已学习的概念,从而促使模型从同一数据源中学习更全面的代表性特征。

链接: https://arxiv.org/abs/2505.12803
作者: Jiawen Xu,Odej Kao,Margret Keuper
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open set recognition (OSR) is devised to address the problem of detecting novel classes during model inference. Even in recent vision models, this remains an open issue which is receiving increasing attention. Thereby, a crucial challenge is to learn features that are relevant for unseen categories from given data, for which these features might not be discriminative. To facilitate this process and “optimize to learn” more diverse features, we propose GradMix, a data augmentation method that dynamically leverages gradient-based attribution maps of the model during training to mask out already learned concepts. Thus GradMix encourages the model to learn a more complete set of representative features from the same data source. Extensive experiments on open set recognition, close set classification, and out-of-distribution detection reveal that our method can often outperform the state-of-the-art. GradMix can further increase model robustness to corruptions as well as downstream classification performance for self-supervised learning, indicating its benefit for model generalization.
zh

[CV-69] Enhancing Transformers Through Conditioned Embedded Tokens

【速读】:该论文试图解决Transformer模型中注意力块(attention block)固有的病态性(ill-conditioning)问题,该问题阻碍了基于梯度的优化并导致训练效率低下。解决方案的关键在于建立注意力块条件与嵌入标记数据条件之间的直接关系,并引入条件化嵌入标记(conditioned embedded tokens),通过系统性地修改嵌入标记来改善注意力机制的条件,从而显著缓解病态性,提升训练的稳定性和效率。

链接: https://arxiv.org/abs/2505.12789
作者: Hemanth Saratchandran,Simon Lucey
机构: Australian Institute for Machine Learning, University of Adelaide
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.
zh

[CV-70] AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning

【速读】:该论文旨在解决当前3D多模态大模型(3D Multimodal Models, 3D LMMs)在进行多模态推理时存在的计算效率低下和冗余信息流问题。其关键解决方案是提出AdaToken-3D,一种基于空间贡献分析的自适应空间标记优化框架,通过动态剪枝冗余标记来提升模型效率。该方法通过注意力模式挖掘量化标记级信息流,自动适配不同3D LMM架构的剪枝策略,从而在保持任务准确性的前提下显著提升推理速度并降低计算量。

链接: https://arxiv.org/abs/2505.12782
作者: Kai Zhang,Xingyu Chen,Xiaofeng Zhang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning, demonstrating remarkable capabilities in 3D scene understanding. However, current 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies: excessive computational overhead and redundant information flows. Unlike 2D VLMs processing single images, 3D LMMs exhibit inherent architectural redundancy due to the heterogeneous mechanisms between spatial tokens and visual tokens. To address this challenge, we propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens through spatial contribution analysis. Our method automatically tailors pruning strategies to different 3D LMM architectures by quantifying token-level information flows via attention pattern mining. Extensive experiments on LLaVA-3D (a 7B parameter 3D-LMM) demonstrate that AdaToken-3D achieves 21% faster inference speed and 63% FLOPs reduction while maintaining original task accuracy. Beyond efficiency gains, this work systematically investigates redundancy patterns in multimodal spatial information flows through quantitative token interaction analysis. Our findings reveal that over 60% of spatial tokens contribute minimally ( 5%) to the final predictions, establishing theoretical foundations for efficient 3D multimodal learning.
zh

[CV-71] UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes

【速读】:该论文旨在解决复杂场景中人体运动合成的问题,特别是针对传统文本到运动(Text-to-Motion)任务之外,需要整合静态环境、可移动物体、自然语言提示和空间航点等多模态信息的挑战。现有语言条件化运动模型由于运动分词(motion tokenization)的局限性,在生成具有场景意识的人体运动时存在信息丢失,无法捕捉3D人体运动的连续性和上下文依赖性。解决方案的关键在于提出UniHM,这是一个统一的运动语言模型,采用基于扩散的生成方法进行场景感知的人体运动合成。其核心创新包括:混合运动表示(融合连续6DoF运动与离散局部运动标记)、一种超越传统VQ-VAEs的无查找量化变分自编码器(LFQ-VAE),以及增强版的Lingo数据集,提升了场景特定运动学习的监督效果。

链接: https://arxiv.org/abs/2505.12774
作者: Zichen Geng,Zeeshan Hayder,Wei Liu,Ajmal Mian
机构: University of Western Australia (西澳大学); Commonwealth Scientific and Industrial Research Organisation (澳大利亚联邦科学与工业研究组织)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion synthesis in complex scenes presents a fundamental challenge, extending beyond conventional Text-to-Motion tasks by requiring the integration of diverse modalities such as static environments, movable objects, natural language prompts, and spatial waypoints. Existing language-conditioned motion models often struggle with scene-aware motion generation due to limitations in motion tokenization, which leads to information loss and fails to capture the continuous, context-dependent nature of 3D human movement. To address these issues, we propose UniHM, a unified motion language model that leverages diffusion-based generation for synthesizing scene-aware human motion. UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes. Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE (LFQ-VAE) that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of the Lingo dataset augmented with HumanML3D annotations, providing stronger supervision for scene-specific motion learning. Experimental results demonstrate that UniHM achieves comparative performance on the OMOMO benchmark for text-to-HOI synthesis and yields competitive results on HumanML3D for general text-conditioned motion generation.
zh

[CV-72] Pyramid Sparse Transformer: Enhancing Multi-Scale Feature Fusion with Dynamic Token Selection

【速读】:该论文旨在解决特征融合在视觉模型中因计算复杂度高而限制其在资源受限环境中的效率问题。解决方案的关键在于提出一种轻量级、即插即用的模块——金字塔稀疏Transformer(Pyramid Sparse Transformer, PST),通过粗粒度到细粒度的token选择和共享注意力参数,在减少计算量的同时保持空间细节,从而实现高效的特征融合。

链接: https://arxiv.org/abs/2505.12772
作者: Junyi Hu,Tian Bai,Fengyi Wu,Zhengming Peng,Yi Zhang
机构: Tsinghua University (清华大学); UESTC (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Feature fusion is critical for high-performance vision models but often incurs prohibitive complexity. However, prevailing attention-based fusion methods often involve significant computational complexity and implementation challenges, limiting their efficiency in resource-constrained environments. To address these issues, we introduce the Pyramid Sparse Transformer (PST), a lightweight, plug-and-play module that integrates coarse-to-fine token selection and shared attention parameters to reduce computation while preserving spatial detail. PST can be trained using only coarse attention and seamlessly activated at inference for further accuracy gains without retraining. When added to state-of-the-art real-time detection models, such as YOLOv11-N/S/M, PST yields mAP improvements of 0.9%, 0.5%, and 0.4% on MS COCO with minimal latency impact. Likewise, embedding PST into ResNet-18/50/101 as backbones, boosts ImageNet top-1 accuracy by 6.5%, 1.7%, and 1.0%, respectively. These results demonstrate PST’s effectiveness as a simple, hardware-friendly enhancement for both detection and classification tasks.
zh

[CV-73] Reasoning -OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?

【速读】:该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在基于光学字符识别(OCR)线索处理复杂逻辑推理问题方面的能力不足问题。现有基准测试主要关注LMMs在相对简单的视觉问答和视觉文本解析任务中的表现,而对LMMs在复杂逻辑推理任务中的表现研究较少。为解决这一问题,作者提出了Reasoning-OCR基准,该基准涵盖六个视觉场景,并设计了150个分类为六种推理挑战的精心构建的问题,同时尽量减少领域专业知识的影响。其关键在于通过丰富的视觉文本线索来评估和提升LMMs的复杂推理能力。

链接: https://arxiv.org/abs/2505.12766
作者: Haibin He,Maoyuan Ye,Jing Zhang,Xiantao Cai,Juhua Liu,Bo Du,Dacheng Tao
机构: Wuhan University(武汉大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs’ abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at this https URL.
zh

[CV-74] Its not you its me – Global urban visual perception varies across demographics and personalities

【速读】:该论文试图解决当前城市规划决策中对人群偏好和需求理解不足的问题,尤其是现有方法常将多文化、多城市人口的偏好混合处理,从而掩盖了重要的群体差异并可能放大偏见。其解决方案的关键在于通过全球范围内的街道景观视觉感知调查,构建一个考虑社会经济因素的大型数据集(Street Perception Evaluation Considering Socioeconomics, SPECS),并分析包括性别、年龄、收入、教育、种族、民族以及首次纳入的个性特征在内的多种人口统计学因素如何影响人们对城市街道景观的感知。该研究揭示了地理位置情感在不同城市街道景观比较中的延续性,并指出现有机器学习模型在预测感知评分时可能存在偏差,强调了在制定针对性干预措施时应考虑本地居民的实际感知。

链接: https://arxiv.org/abs/2505.12758
作者: Matias Quintana,Youlong Gu,Xiucheng Liang,Yujun Hou,Koichi Ito,Yihan Zhu,Mahmoud Abdelrahman,Filip Biljecki
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Understanding people’s preferences and needs is crucial for urban planning decisions, yet current approaches often combine them from multi-cultural and multi-city populations, obscuring important demographic differences and risking amplifying biases. We conducted a large-scale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics – including gender, age, income, education, race and ethnicity, and, for the first time, personality traits – shape perceptions among 1,000 participants, with balanced demographics, from five countries and 45 nationalities. This dataset, introduced as Street Perception Evaluation Considering Socioeconomics (SPECS), exhibits statistically significant differences in perception scores in six traditionally used indicators (safe, lively, wealthy, beautiful, boring, and depressing) and four new ones we propose (live nearby, walk, cycle, green) among demographics and personalities. We revealed that location-based sentiments are carried over in people’s preferences when comparing urban streetscapes with other cities. Further, we compared the perception scores based on where participants and streetscapes are from. We found that an off-the-shelf machine learning model trained on an existing global perception dataset tends to overestimate positive indicators and underestimate negative ones compared to human responses, suggesting that targeted intervention should consider locals’ perception. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.
zh

[CV-75] LiDAR MOT-DETR: A LiDAR-based Two-Stage Transformer for 3D Multiple Object Tracking

【速读】:该论文旨在解决从LiDAR点云中进行多目标跟踪(Multi-object tracking)所面临的独特挑战,包括数据的稀疏性和不规则性,以及跨帧的时间一致性需求。传统跟踪系统依赖于手工设计的特征和运动模型,在密集或高速移动场景中难以保持对象身份的一致性。该研究提出了一种基于LiDAR的两阶段DETR启发式Transformer架构,其关键在于通过平滑阶段对任意现成检测器生成的LiDAR目标检测结果进行时间窗口内的优化,再通过跟踪阶段利用DETR注意力块,结合点云作为上下文信息,实现目标在时间上的关联与跟踪。

链接: https://arxiv.org/abs/2505.12753
作者: Martha Teiko Teye,Ori Maoz,Matthias Rottmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object tracking from LiDAR point clouds presents unique challenges due to the sparse and irregular nature of the data, compounded by the need for temporal coherence across frames. Traditional tracking systems often rely on hand-crafted features and motion models, which can struggle to maintain consistent object identities in crowded or fast-moving scenes. We present a lidar-based two-staged DETR inspired transformer; a smoother and tracker. The smoother stage refines lidar object detections, from any off-the-shelf detector, across a moving temporal window. The tracker stage uses a DETR-based attention block to maintain tracks across time by associating tracked objects with the refined detections using the point cloud as context. The model is trained on the datasets nuScenes and KITTI in both online and offline (forward peeking) modes demonstrating strong performance across metrics such as ID-switch and multiple object tracking accuracy (MOTA). The numerical results indicate that the online mode outperforms the lidar-only baseline and SOTA models on the nuScenes dataset, with an aMOTA of 0.722 and an aMOTP of 0.475, while the offline mode provides an additional 3 pp aMOTP
zh

[CV-76] Structure-based Anomaly Detection and Clustering

【速读】:该论文旨在解决在结构化和流数据场景下的异常检测问题,以及在网络安全领域中对未知恶意软件家族的开放集识别问题。其解决方案的关键在于提出了一系列基于结构的异常检测方法,如Preference Isolation Forest (PIF) 及其变体Voronoi-iForest和RuzHash-iForest,通过将数据嵌入高维偏好空间并利用几何距离或局部敏感哈希进行离群点隔离,同时引入Sliding-PIF以处理流数据中的局部流形信息;此外,还提出了MultiLink方法用于结构化聚类,通过模型感知的链接策略实现多类别结构的鲁棒恢复;在在线异常检测方面,提出了Online-iForest,利用自适应多分辨率直方图和动态更新树结构来实时跟踪数据变化;最后,通过增强梯度提升分类器的MaxLogit方法实现了对未知恶意软件的检测。

链接: https://arxiv.org/abs/2505.12751
作者: Filippo Leveni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Doctoral dissertation at Politecnico di Milano

点击查看摘要

Abstract:Anomaly detection is a fundamental problem in domains such as healthcare, manufacturing, and cybersecurity. This thesis proposes new unsupervised methods for anomaly detection in both structured and streaming data settings. In the first part, we focus on structure-based anomaly detection, where normal data follows low-dimensional manifolds while anomalies deviate from them. We introduce Preference Isolation Forest (PIF), which embeds data into a high-dimensional preference space via manifold fitting, and isolates outliers using two variants: Voronoi-iForest, based on geometric distances, and RuzHash-iForest, leveraging Locality Sensitive Hashing for scalability. We also propose Sliding-PIF, which captures local manifold information for streaming scenarios. Our methods outperform existing techniques on synthetic and real datasets. We extend this to structure-based clustering with MultiLink, a novel method for recovering multiple geometric model families in noisy data. MultiLink merges clusters via a model-aware linkage strategy, enabling robust multi-class structure recovery. It offers key advantages over existing approaches, such as speed, reduced sensitivity to thresholds, and improved robustness to poor initial sampling. The second part of the thesis addresses online anomaly detection in evolving data streams. We propose Online Isolation Forest (Online-iForest), which uses adaptive, multi-resolution histograms and dynamically updates tree structures to track changes over time. It avoids retraining while achieving accuracy comparable to offline models, with superior efficiency for real-time applications. Finally, we tackle anomaly detection in cybersecurity via open-set recognition for malware classification. We enhance a Gradient Boosting classifier with MaxLogit to detect unseen malware families, a method now integrated into Cleafy’s production system.
zh

[CV-77] OpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation

【速读】:该论文旨在解决当前多手灵巧遥操作系统缺乏统一基准以进行公平、可复现比较的问题。其关键解决方案是提出TeleOpBench,一个以模拟器为中心的基准,包含30个高保真任务环境,并实现了四种典型的遥操作模态(运动捕捉、VR设备、臂手外骨骼和单目视觉跟踪),通过统一的协议和度量体系进行评估,同时验证了仿真性能与真实世界行为之间的强相关性,从而确立了遥操作研究的共同基准。

链接: https://arxiv.org/abs/2505.12748
作者: Hangyu Li,Qin Zhao,Haoran Xu,Xinyu Jiang,Qingwei Ben,Feiyu Jia,Haoyu Zhao,Liang Xu,Jia Zeng,Hanqing Wang,Bo Dai,Junting Dong,Jiangmiao Pang
机构: Shanghai AI Laboratory; Zhejiang University; The Chinese University of Hong Kong; The Hong Kong University of Science and Technology (Guangzhou); The University of Hong Kong; Feeling AI
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Teleoperation is a cornerstone of embodied-robot learning, and bimanual dexterous teleoperation in particular provides rich demonstrations that are difficult to obtain with fully autonomous systems. While recent studies have proposed diverse hardware pipelines-ranging from inertial motion-capture gloves to exoskeletons and vision-based interfaces-there is still no unified benchmark that enables fair, reproducible comparison of these systems. In this paper, we introduce TeleOpBench, a simulator-centric benchmark tailored to bimanual dexterous teleoperation. TeleOpBench contains 30 high-fidelity task environments that span pick-and-place, tool use, and collaborative manipulation, covering a broad spectrum of kinematic and force-interaction difficulty. Within this benchmark we implement four representative teleoperation modalities-(i) MoCap, (ii) VR device, (iii) arm-hand exoskeletons, and (iv) monocular vision tracking-and evaluate them with a common protocol and metric suite. To validate that performance in simulation is predictive of real-world behavior, we conduct mirrored experiments on a physical dual-arm platform equipped with two 6-DoF dexterous hands. Across 10 held-out tasks we observe a strong correlation between simulator and hardware performance, confirming the external validity of TeleOpBench. TeleOpBench establishes a common yardstick for teleoperation research and provides an extensible platform for future algorithmic and hardware innovation.
zh

[CV-78] MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

【速读】:该论文旨在解决视觉生成中传统方法在建模视觉数据先验时存在的计算复杂度高、冗余问题以及生成延迟较大的问题。其关键解决方案是提出一种新的自回归框架——马尔可夫视觉自回归建模(Markovian Visual AutoRegressive modeling, MVAR),通过引入尺度马尔可夫轨迹和空间马尔可夫注意力机制,分别减少尺度间和空间上的冗余依赖,从而将注意力计算的复杂度从O(N²)降低至O(Nk),显著降低了GPU内存消耗并提升了训练效率。

链接: https://arxiv.org/abs/2505.12742
作者: Jinhua Zhang,Wei Long,Minghao Han,Weiyi You,Shuhang Gu
机构: University of Electronic Science and Technology of China (中国电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.
zh

[CV-79] FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)和多模态模型(Multimodal Models, LMMs)在推理过程中因解码速度缓慢而导致的效率问题,尤其针对LMM中视觉输入token数量多且信息密度低的问题。其解决方案的关键在于提出一种专为LMM设计的推测解码框架FLASH(Fast Latent-Aware Semi-Autoregressive Heuristics),该框架利用多模态数据的两个关键特性:一是通过轻量级潜在感知的token压缩机制减少视觉token的冗余,二是采用半自回归解码策略在每次前向传播中生成多个token,从而在保持高接受率的同时加速解码过程。

链接: https://arxiv.org/abs/2505.12728
作者: Zihua Wang,Ruibo Li,Haozhe Du,Joey Tianyi Zhou,Yu Zhang,Xu Yang
机构: Southeast University (东南大学); Nanyang Technological University (南洋理工大学); Hunan University (湖南大学); A*STAR Centre for Frontier AI Research (CFAR) (新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds. This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text – an issue exacerbated by recent trends toward finer-grained visual tokenizations to boost performance. Speculative decoding has been effective in accelerating LLM inference by using a smaller draft model to generate candidate tokens, which are then selectively verified by the target model, improving speed without sacrificing output quality. While this strategy has been extended to LMMs, existing methods largely overlook the unique properties of visual inputs and depend solely on text-based draft models. In this work, we propose \textbfFLASH (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs, which leverages two key properties of multimodal data to design the draft model. First, to address redundancy in visual tokens, we propose a lightweight latent-aware token compression mechanism. Second, recognizing that visual objects often co-occur within a scene, we employ a semi-autoregressive decoding strategy to generate multiple tokens per forward pass. These innovations accelerate draft decoding while maintaining high acceptance rates, resulting in faster overall inference. Experiments show that FLASH significantly outperforms prior speculative decoding approaches in both unimodal and multimodal settings, achieving up to \textbf2.68 \times speed-up on video captioning and \textbf2.55 \times on visual instruction tuning tasks compared to the original LMM.
zh

[CV-80] VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection

【速读】:该论文试图解决多传感器模态融合中忽视环境条件和传感器输入细微变化的问题,导致现有融合方法在不同环境下难以自适应地调整各模态权重。解决方案的关键在于引入视觉-语言条件融合(Vision-Language Conditioned Fusion, VLC Fusion),该框架利用视觉-语言模型(Vision-Language Model, VLM)捕捉高阶环境上下文信息,并据此动态调整各模态的权重,从而提升对象检测的适应性和准确性。

链接: https://arxiv.org/abs/2505.12715
作者: Aditya Taparia,Noel Ngu,Mario Leiva,Joshua Shay Kricheli,John Corcoran,Nathaniel D. Bastian,Gerardo Simari,Paulo Shakarian,Ransalu Senanayake
机构: Arizona State University (亚利桑那州立大学); Universidad Nacional del Sur (国立南大学); Institute for Computer Science and Engineering (计算机科学与工程研究所); U.S. Department of Defense (美国国防部); United States Military Academy (美国军事学院); Syracuse University (锡拉丘兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 19 figures

点击查看摘要

Abstract:Although fusing multiple sensor modalities can enhance object detection performance, existing fusion approaches often overlook subtle variations in environmental conditions and sensor inputs. As a result, they struggle to adaptively weight each modality under such variations. To address this challenge, we introduce Vision-Language Conditioned Fusion (VLC Fusion), a novel fusion framework that leverages a Vision-Language Model (VLM) to condition the fusion process on nuanced environmental cues. By capturing high-level environmental context such as as darkness, rain, and camera blurring, the VLM guides the model to dynamically adjust modality weights based on the current scene. We evaluate VLC Fusion on real-world autonomous driving and military target detection datasets that include image, LIDAR, and mid-wave infrared modalities. Our experiments show that VLC Fusion consistently outperforms conventional fusion baselines, achieving improved detection accuracy in both seen and unseen scenarios.
zh

[CV-81] IA-MVS: Instance-Focused Adaptive Depth Sampling for Multi-View Stereo

【速读】:该论文试图解决多视图立体(Multi-view stereo, MVS)模型中深度估计精度受限的问题,主要原因是单个实例的深度覆盖范围小于整个场景,且初始阶段的偏差会随着处理过程累积。解决方案的关键在于提出一种实例自适应的MVS方法(Instance-Adaptive MVS, IA-MVS),通过缩小深度假设范围并针对每个实例进行细化来提升深度估计精度,同时引入基于实例内深度连续性先验的过滤机制以增强鲁棒性。此外,论文还构建了一个基于条件概率的置信度估计数学模型,以避免现有置信度估计对IA-MVS性能的负面影响。

链接: https://arxiv.org/abs/2505.12714
作者: Yinzhe Wang,Yiwen Xiao,Hu Wang,Yiping Xu,Yan Tian
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view stereo (MVS) models based on progressive depth hypothesis narrowing have made remarkable advancements. However, existing methods haven’t fully utilized the potential that the depth coverage of individual instances is smaller than that of the entire scene, which restricts further improvements in depth estimation precision. Moreover, inevitable deviations in the initial stage accumulate as the process advances. In this paper, we propose Instance-Adaptive MVS (IA-MVS). It enhances the precision of depth estimation by narrowing the depth hypothesis range and conducting refinement on each instance. Additionally, a filtering mechanism based on intra-instance depth continuity priors is incorporated to boost robustness. Furthermore, recognizing that existing confidence estimation can degrade IA-MVS performance on point clouds. We have developed a detailed mathematical model for confidence estimation based on conditional probability. The proposed method can be widely applied in models based on MVSNet without imposing extra training burdens. Our method achieves state-of-the-art performance on the DTU benchmark. The source code is available at this https URL.
zh

[CV-82] Any-to-Any Learning in Computational Pathology via Triplet Multimodal Pretraining

【速读】:该论文旨在解决计算病理学中多模态数据融合的三个关键挑战:异构数据类型的融合需要超越简单拼接的复杂策略以应对高计算成本;缺失模态的常见场景要求模型具备在缺乏某些模态时仍能稳健学习的灵活性;以及CPath下游任务的多样性,需一个能够处理所有模态的统一模型。其解决方案的关键在于提出ALTER,一个任意到任意的三模态预训练框架,整合了组织切片图像(WSI)、基因组学和病理报告,通过模态自适应设计实现灵活的预训练和跨模态表示学习,从而有效应对上述挑战。

链接: https://arxiv.org/abs/2505.12711
作者: Qichen Sun,Zhengrui Guo,Rui Peng,Hao Chen,Jinzhuo Wang
机构: Peking University (北京大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in computational pathology and artificial intelligence have significantly enhanced the utilization of gigapixel whole-slide images and and additional modalities (e.g., genomics) for pathological diagnosis. Although deep learning has demonstrated strong potential in pathology, several key challenges persist: (1) fusing heterogeneous data types requires sophisticated strategies beyond simple concatenation due to high computational costs; (2) common scenarios of missing modalities necessitate flexible strategies that allow the model to learn robustly in the absence of certain modalities; (3) the downstream tasks in CPath are diverse, ranging from unimodal to multimodal, cnecessitating a unified model capable of handling all modalities. To address these challenges, we propose ALTER, an any-to-any tri-modal pretraining framework that integrates WSIs, genomics, and pathology reports. The term “any” emphasizes ALTER’s modality-adaptive design, enabling flexible pretraining with any subset of modalities, and its capacity to learn robust, cross-modal representations beyond WSI-centric approaches. We evaluate ALTER across extensive clinical tasks including survival prediction, cancer subtyping, gene mutation prediction, and report generation, achieving superior or comparable performance to state-of-the-art baselines.
zh

[CV-83] SpatialLLM : From Multi-modality Data to Urban Spatial Intelligence

【速读】:该论文试图解决复杂城市场景中的空间智能任务,传统方法通常依赖地理分析工具或领域专业知识,而SpatialLLM作为一种统一的语言模型,无需任何训练、微调或专家干预即可直接处理多种空间智能任务。解决方案的关键在于从原始空间数据中构建详细且结构化的场景描述,以提示预训练的大语言模型(LLM)进行基于场景的分析,从而实现对空间分布信息的准确感知和零样本执行高级空间智能任务。

链接: https://arxiv.org/abs/2505.12703
作者: Jiabin Chen,Haiping Wang,Jinpeng Li,Yuan Liu,Zhen Dong,Bisheng Yang
机构: Wuhan University (武汉大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SpatialLLM, a novel approach advancing spatial intelligence tasks in complex urban scenes. Unlike previous methods requiring geographic analysis tools or domain expertise, SpatialLLM is a unified language model directly addressing various spatial intelligence tasks without any training, fine-tuning, or expert intervention. The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis. Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information and enable zero-shot execution of advanced spatial intelligence tasks, including urban planning, ecological analysis, traffic management, etc. We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis. We hope that SpatialLLM will provide a novel viable perspective for urban intelligent analysis and management. The code and dataset are available at this https URL.
zh

[CV-84] Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

【速读】:该论文旨在解决长视频指代视频目标分割(Long-RVOS)中存在的挑战,即现有数据集主要关注几秒内的短视频片段,且多数帧中目标显著可见,无法满足实际应用场景的需求。其解决方案的关键在于提出一种新的基准数据集Long-RVOS,并引入两种新的评估指标以衡量时间一致性与时空一致性。此外,论文还提出了ReferMo方法,该方法通过整合运动信息扩展时间感受野,并采用从局部到全局的架构来捕捉短期动态与长期依赖关系,从而在长视频场景下取得显著性能提升。

链接: https://arxiv.org/abs/2505.12702
作者: Tianming Liang,Haichao Jiang,Yuting Yang,Chaolei Tan,Shuai Li,Wei-Shi Zheng,Jian-Fang Hu
机构: Sun Yat-sen University (中山大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \url{ this https URL }

点击查看摘要

Abstract:Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbfLong-RVOS, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.
zh

[CV-85] ACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy

【速读】:该论文旨在解决多模态3D语义占用预测中的性能瓶颈问题,主要挑战包括固定融合策略导致的几何-语义不匹配以及稀疏、噪声标注引起的表面细节丢失。其解决方案的关键在于提出了一种目标尺度自适应的双向对称检索机制,该机制通过扩展大目标的邻域以增强上下文感知,缩小小目标的邻域以提升效率并抑制噪声,从而实现跨模态特征的准确对齐。此外,为缓解表面细节丢失问题,引入了基于3D高斯点云渲染的改进体积渲染流水线,结合光度一致性监督优化2D-3D一致性,提升了表面细节重建能力。

链接: https://arxiv.org/abs/2505.12693
作者: Luyao Lei,Shuo Xu,Yifan Bai,Xing Wei
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The performance of multi-modal 3D occupancy prediction is limited by ineffective fusion, mainly due to geometry-semantics mismatch from fixed fusion strategies and surface detail loss caused by sparse, noisy annotations. The mismatch stems from the heterogeneous scale and distribution of point cloud and image features, leading to biased matching under fixed neighborhood fusion. To address this, we propose a target-scale adaptive, bidirectional symmetric retrieval mechanism. It expands the neighborhood for large targets to enhance context awareness and shrinks it for small ones to improve efficiency and suppress noise, enabling accurate cross-modal feature alignment. This mechanism explicitly establishes spatial correspondences and improves fusion accuracy. For surface detail loss, sparse labels provide limited supervision, resulting in poor predictions for small objects. We introduce an improved volume rendering pipeline based on 3D Gaussian Splatting, which takes fused features as input to render images, applies photometric consistency supervision, and jointly optimizes 2D-3D consistency. This enhances surface detail reconstruction while suppressing noise propagation. In summary, we propose TACOcc, an adaptive multi-modal fusion framework for 3D semantic occupancy prediction, enhanced by volume rendering supervision. Experiments on the nuScenes and SemanticKITTI benchmarks validate its effectiveness.
zh

[CV-86] Mamba-Adaptor: State Space Model Adaptor for Visual Recognition CVPR

【速读】:该论文旨在解决Mamba模型在视觉任务中表现不佳的问题,主要受限于顺序模型的三个缺陷:1)因果计算无法获取全局上下文;2)计算当前隐藏状态时存在长程遗忘问题;3)由于序列输入转换导致的空间结构建模能力弱。其解决方案的关键在于提出一种简单而强大的视觉任务适配器(Mamba-Adaptor),包含两个功能模块:Adaptor-T通过轻量级预测模块选择可学习位置作为记忆增强,缓解长程遗忘问题;Adaptor-S利用多尺度扩张卷积核增强空间建模并引入图像归纳偏差。这两个模块共同提升了因果计算中的上下文建模能力。

链接: https://arxiv.org/abs/2505.12685
作者: Fei Xie,Jiahao Nie,Yujin Tang,Wenkang Zhang,Hongshen Zhao
机构: Shanghai Jiao Tong University (上海交通大学); Hangzhou Dianzi University (杭州电子科技大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR paper

点击查看摘要

Abstract:Recent State Space Models (SSM), especially Mamba, have demonstrated impressive performance in visual modeling and possess superior model efficiency. However, the application of Mamba to visual tasks suffers inferior performance due to three main constraints existing in the sequential model: 1) Casual computing is incapable of accessing global context; 2) Long-range forgetting when computing the current hidden states; 3) Weak spatial structural modeling due to the transformed sequential input. To address these issues, we investigate a simple yet powerful vision task Adaptor for Mamba models, which consists of two functional modules: Adaptor-T and Adaptor-S. When solving the hidden states for SSM, we apply a lightweight prediction module Adaptor-T to select a set of learnable locations as memory augmentations to ease long-range forgetting issues. Moreover, we leverage Adapator-S, composed of multi-scale dilated convolutional kernels, to enhance the spatial modeling and introduce the image inductive bias into the feature output. Both modules can enlarge the context modeling in casual computing, as the output is enhanced by the inaccessible features. We explore three usages of Mamba-Adaptor: A general visual backbone for various vision tasks; A booster module to raise the performance of pretrained backbones; A highly efficient fine-tuning module that adapts the base model for transfer learning tasks. Extensive experiments verify the effectiveness of Mamba-Adaptor in three settings. Notably, our Mamba-Adaptor achieves state-of the-art performance on the ImageNet and COCO benchmarks.
zh

[CV-87] On the Mechanisms of Adversarial Data Augmentation for Robust and Adaptive Transfer Learning

【速读】:该论文试图解决跨域迁移学习中因分布偏移导致的模型鲁棒性和适应性不足的问题。其解决方案的关键在于利用对抗性数据增强(Adversarial Data Augmentation, ADA),通过在训练过程中战略性地引入对抗样本,以丰富决策边界并减少对源域特定特征的过拟合,从而提升模型的域泛化能力。此外,论文还提出了一种统一框架,将ADA与一致性正则化和域不变表示学习相结合,进一步增强了模型在无监督和少样本域适应设置下的目标域性能。

链接: https://arxiv.org/abs/2505.12681
作者: Hana Satou,Alan Mitkiy
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transfer learning across domains with distribution shift remains a fundamental challenge in building robust and adaptable machine learning systems. While adversarial perturbations are traditionally viewed as threats that expose model vulnerabilities, recent studies suggest that they can also serve as constructive tools for data augmentation. In this work, we systematically investigate the role of adversarial data augmentation (ADA) in enhancing both robustness and adaptivity in transfer learning settings. We analyze how adversarial examples, when used strategically during training, improve domain generalization by enriching decision boundaries and reducing overfitting to source-domain-specific features. We further propose a unified framework that integrates ADA with consistency regularization and domain-invariant representation learning. Extensive experiments across multiple benchmark datasets – including VisDA, DomainNet, and Office-Home – demonstrate that our method consistently improves target-domain performance under both unsupervised and few-shot domain adaptation settings. Our results highlight a constructive perspective of adversarial learning, transforming perturbation from a destructive attack into a regularizing force for cross-domain transferability.
zh

[CV-88] CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models

【速读】:该论文试图解决文本到图像生成模型在生成过程中可能产生的不安全、受版权保护或侵犯隐私的内容问题。现有安全干预措施普遍存在概念移除不彻底、易被绕过、计算效率低或对无关能力造成副作用等缺陷。解决方案的关键在于提出CURE框架,该框架无需重新训练即可在预训练扩散模型的权重空间中直接实现目标概念的高效、可解释且高度特异性的消除。其核心是Spectral Eraser模块,通过在相关概念的token嵌入上进行奇异值分解,识别并隔离与目标概念相关的特征子空间,从而在单步更新中有效移除 undesired concepts,同时保留安全属性。

链接: https://arxiv.org/abs/2505.12677
作者: Shristi Das Biswas,Arani Roy,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As Text-to-Image models continue to evolve, so does the risk of generating unsafe, copyrighted, or privacy-violating content. Existing safety interventions - ranging from training data curation and model fine-tuning to inference-time filtering and guidance - often suffer from incomplete concept removal, susceptibility to jail-breaking, computational inefficiency, or collateral damage to unrelated capabilities. In this paper, we introduce CURE, a training-free concept unlearning framework that operates directly in the weight space of pre-trained diffusion models, enabling fast, interpretable, and highly specific suppression of undesired concepts. At the core of our method is the Spectral Eraser, a closed-form, orthogonal projection module that identifies discriminative subspaces using Singular Value Decomposition over token embeddings associated with the concepts to forget and retain. Intuitively, the Spectral Eraser identifies and isolates features unique to the undesired concept while preserving safe attributes. This operator is then applied in a single step update to yield an edited model in which the target concept is effectively unlearned - without retraining, supervision, or iterative optimization. To balance the trade-off between filtering toxicity and preserving unrelated concepts, we further introduce an Expansion Mechanism for spectral regularization which selectively modulates singular vectors based on their relative significance to control the strength of forgetting. All the processes above are in closed-form, guaranteeing extremely efficient erasure in only 2 seconds. Benchmarking against prior approaches, CURE achieves a more efficient and thorough removal for targeted artistic styles, objects, identities, or explicit content, with minor damage to original generation ability and demonstrates enhanced robustness against red-teaming.
zh

[CV-89] Few-Step Diffusion via Score identity Distillation

【速读】:该论文旨在解决高分辨率文本到图像扩散模型(如Stable Diffusion XL, SDXL)在知识蒸馏过程中依赖真实或教师合成图像的问题,以及分类器无关引导(classifier-free guidance, CFG)导致的文本-图像对齐与生成多样性之间的权衡问题。其解决方案的关键在于优化Score identity Distillation (SiD)——一种无需数据、一步蒸馏框架,并针对少步生成进行改进,通过理论分析证明匹配所有生成步骤输出的均匀混合与数据分布的一致性,从而避免了步骤特定网络的使用,并可无缝集成到现有流程中。此外,为缓解真实文本-图像对存在的对齐-多样性权衡,引入了基于扩散生成对抗网络(Diffusion GAN)的对抗损失,并提出了两种新的引导策略:Zero-CFG和Anti-CFG,以提升生成多样性而不牺牲对齐效果。

链接: https://arxiv.org/abs/2505.12674
作者: Mingyuan Zhou,Yi Gu,Zhendong Wang
机构: UT Austin (德州大学奥斯汀分校); Google (谷歌); Google DeepMind (谷歌深度思维); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion distillation has emerged as a promising strategy for accelerating text-to-image (T2I) diffusion models by distilling a pretrained score network into a one- or few-step generator. While existing methods have made notable progress, they often rely on real or teacher-synthesized images to perform well when distilling high-resolution T2I diffusion models such as Stable Diffusion XL (SDXL), and their use of classifier-free guidance (CFG) introduces a persistent trade-off between text-image alignment and generation diversity. We address these challenges by optimizing Score identity Distillation (SiD) – a data-free, one-step distillation framework – for few-step generation. Backed by theoretical analysis that justifies matching a uniform mixture of outputs from all generation steps to the data distribution, our few-step distillation algorithm avoids step-specific networks and integrates seamlessly into existing pipelines, achieving state-of-the-art performance on SDXL at 1024x1024 resolution. To mitigate the alignment-diversity trade-off when real text-image pairs are available, we introduce a Diffusion GAN-based adversarial loss applied to the uniform mixture and propose two new guidance strategies: Zero-CFG, which disables CFG in the teacher and removes text conditioning in the fake score network, and Anti-CFG, which applies negative CFG in the fake score network. This flexible setup improves diversity without sacrificing alignment. Comprehensive experiments on SD1.5 and SDXL demonstrate state-of-the-art performance in both one-step and few-step generation settings, along with robustness to the absence of real images. Our efficient PyTorch implementation, along with the resulting one- and few-step distilled generators, will be released publicly as a separate branch at this https URL.
zh

[CV-90] S-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

【速读】:该论文旨在解决现有视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶应用中因计算开销大和多视角传感器数据整合效率低而难以实现实时部署的问题。其解决方案的关键在于设计了一种轻量级的VLM——TS-VLM,该模型引入了新颖的文本引导软排序池化(Text-Guided SoftSort Pooling, TGSSP)模块,通过输入查询的语义对多视角视觉特征进行排序与融合,实现了无需依赖昂贵注意力机制的动态、查询感知的多视角聚合,从而提升了多视角推理的上下文准确性并显著降低了计算成本。

链接: https://arxiv.org/abs/2505.12670
作者: Lihong Chen,Hossein Hassani,Soodeh Nikan
机构: Western University (西安大略大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable potential in advancing autonomous driving by leveraging multi-modal fusion in order to enhance scene perception, reasoning, and decision-making. Despite their potential, existing models suffer from computational overhead and inefficient integration of multi-view sensor data that make them impractical for real-time deployment in safety-critical autonomous driving applications. To address these shortcomings, this paper is devoted to designing a lightweight VLM called TS-VLM, which incorporates a novel Text-Guided SoftSort Pooling (TGSSP) module. By resorting to semantics of the input queries, TGSSP ranks and fuses visual features from multiple views, enabling dynamic and query-aware multi-view aggregation without reliance on costly attention mechanisms. This design ensures the query-adaptive prioritization of semantically related views, which leads to improved contextual accuracy in multi-view reasoning for autonomous driving. Extensive evaluations on the DriveLM benchmark demonstrate that, on the one hand, TS-VLM outperforms state-of-the-art models with a BLEU-4 score of 56.82, METEOR of 41.91, ROUGE-L of 74.64, and CIDEr of 3.39. On the other hand, TS-VLM reduces computational cost by up to 90%, where the smallest version contains only 20.1 million parameters, making it more practical for real-time deployment in autonomous vehicles.
zh

[CV-91] Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

【速读】:该论文旨在解决生成式视频模型中AI生成内容的版权保护问题,特别是针对视频生成领域中隐式水印技术研究不足的现状。其解决方案的关键在于提出Safe-Sora框架,通过将图形水印直接嵌入视频生成过程,并引入分层粗到细的自适应匹配机制,实现水印与视频内容在视觉上的高度相似性,从而提升水印的隐蔽性和鲁棒性。此外,该工作还设计了基于3D小波变换增强的Mamba架构,结合时空局部扫描策略,有效建模水印嵌入与提取过程中的长程依赖关系,为水印保护提供了新的方法路径。

链接: https://arxiv.org/abs/2505.12667
作者: Zihan Su,Xuerui Qiu,Hongbin Xu,Tangyu Jiang,Junhao Zhuang,Chun Yuan,Ming Li,Shengfeng He,Fei Richard Yu
机构: Tsinghua University (清华大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); South China University of Technology (华南理工大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室(深圳)); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.
zh

[CV-92] Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps

【速读】:该论文试图解决场景理解时间的图像可计算预测问题,即如何准确预测人类在不同场景下进行理解所需的时间。其解决方案的关键在于将人眼的中央凹视觉特性与视觉语言模型(VLMs)相结合,提出了一种新的图像可计算模型——福内特场景理解图(F-SUM),该模型能够根据注视位置生成场景理解的空间分布图,并计算出一个综合的F-SUM分数。这一方法通过模拟人类视觉系统与图像中任务相关视觉信息的空间分布之间的交互,有效捕捉了场景理解的复杂性。

链接: https://arxiv.org/abs/2505.12660
作者: Ziqi Wen,Jonathan Skaza,Shravan Murlidaran,William Y. Wang,Miguel P. Eckstein
机构: University of California, Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although models exist that predict human response times (RTs) in tasks such as target search and visual discrimination, the development of image-computable predictors for scene understanding time remains an open challenge. Recent advances in vision-language models (VLMs), which can generate scene descriptions for arbitrary images, combined with the availability of quantitative metrics for comparing linguistic descriptions, offer a new opportunity to model human scene understanding. We hypothesize that the primary bottleneck in human scene understanding and the driving source of variability in response times across scenes is the interaction between the foveated nature of the human visual system and the spatial distribution of task-relevant visual information within an image. Based on this assumption, we propose a novel image-computable model that integrates foveated vision with VLMs to produce a spatially resolved map of scene understanding as a function of fixation location (Foveated Scene Understanding Map, or F-SUM), along with an aggregate F-SUM score. This metric correlates with average (N=17) human RTs (r=0.47) and number of saccades (r=0.51) required to comprehend a scene (across 277 scenes). The F-SUM score also correlates with average (N=16) human description accuracy (r=-0.56) in time-limited presentations. These correlations significantly exceed those of standard image-based metrics such as clutter, visual complexity, and scene ambiguity based on language entropy. Together, our work introduces a new image-computable metric for predicting human response times in scene understanding and demonstrates the importance of foveated visual processing in shaping comprehension difficulty.
zh

[CV-93] SPKLIP: Aligning Spike Video Streams with Natural Language

【速读】:该论文旨在解决Spike视频与语言对齐(Spike-VLA)中的语义理解难题,特别是在面对Spike相机稀疏、异步输出时,传统模型如CLIP因模态不匹配而表现不佳。解决方案的关键在于提出SPKLIP,这是首个专为Spike-VLA设计的架构,其核心包括层次化脉冲特征提取器,用于自适应建模事件流中的多尺度时间动态,并通过脉冲-文本对比学习直接对齐脉冲视频与语言,从而实现有效的少样本学习。此外,SPKLIP还引入了一个全脉冲视觉编码器变体,提升了能效,展示了其在神经形态部署中的潜力。

链接: https://arxiv.org/abs/2505.12656
作者: Yongchang Gao,Meiling Jin,Zhaofei Yu,Tiejun Huang,Guozhang Chen
机构: Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spike cameras offer unique sensing capabilities but their sparse, asynchronous output challenges semantic understanding, especially for Spike Video-Language Alignment (Spike-VLA) where models like CLIP underperform due to modality mismatch. We introduce SPKLIP, the first architecture specifically for Spike-VLA. SPKLIP employs a hierarchical spike feature extractor that adaptively models multi-scale temporal dynamics in event streams, and uses spike-text contrastive learning to directly align spike video with language, enabling effective few-shot learning. A full-spiking visual encoder variant, integrating SNN components into our pipeline, demonstrates enhanced energy efficiency. Experiments show state-of-the-art performance on benchmark spike datasets and strong few-shot generalization on a newly contributed real-world dataset. SPKLIP’s energy efficiency highlights its potential for neuromorphic deployment, advancing event-based multimodal research. The source code and dataset are available at [link removed for anonymity].
zh

[CV-94] AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agent ic Tool Use

【速读】:该论文旨在解决基于机器学习的原子间势和力场模型因缺乏准确原子结构数据而面临的数据稀缺问题,尤其是实验解析晶体的可用性有限。其解决方案的关键在于提出AutoMat,一个端到端的、代理辅助的流程,能够自动将扫描透射电子显微镜(STEM)图像转化为原子晶体结构并预测其物理性质,核心技术包括模式自适应去噪、物理引导的模板检索、对称感知的原子重构、通过MatterSim进行的快速弛豫与性质预测,以及各阶段的协调调度。

链接: https://arxiv.org/abs/2505.12650
作者: Yaotian Yang,Yiwen Tang,Yizhe Chen,Xiao Chen,Jiangjie Qiu,Hao Xiong,Haoyu Yin,Zhiyao Luo,Yifei Zhang,Sijia Tao,Wentao Li,Qinghua Zhang,Yuqiang Li,Wanli Ouyang,Bin Zhao,Xiaonan Wang,Fei Wei
机构: Tsinghua University (清华大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code and dataset are publicly available at this https URL and this https URL

点击查看摘要

Abstract:Machine learning-based interatomic potentials and force fields depend critically on accurate atomic structures, yet such data are scarce due to the limited availability of experimentally resolved crystals. Although atomic-resolution electron microscopy offers a potential source of structural data, converting these images into simulation-ready formats remains labor-intensive and error-prone, creating a bottleneck for model training and validation. We introduce AutoMat, an end-to-end, agent-assisted pipeline that automatically transforms scanning transmission electron microscopy (STEM) images into atomic crystal structures and predicts their physical properties. AutoMat combines pattern-adaptive denoising, physics-guided template retrieval, symmetry-aware atomic reconstruction, fast relaxation and property prediction via MatterSim, and coordinated orchestration across all stages. We propose the first dedicated STEM2Mat-Bench for this task and evaluate performance using lattice RMSD, formation energy MAE, and structure-matching success rate. By orchestrating external tool calls, AutoMat enables a text-only LLM to outperform vision-language models in this domain, achieving closed-loop reasoning throughout the pipeline. In large-scale experiments over 450 structure samples, AutoMat substantially outperforms existing multimodal large language models and tools. These results validate both AutoMat and STEM2Mat-Bench, marking a key step toward bridging microscopy and atomistic simulation in materials this http URL code and dataset are publicly available at this https URL and this https URL.
zh

[CV-95] Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency

【速读】:该论文试图解决代理模型集成攻击中可迁移性与资源效率之间的权衡问题(transferability and efficiency trade-off)。现有方法在使用更多代理模型以提升可迁移性的同时,会降低资源效率,而这一问题限制了攻击效果。论文提出的关键解决方案是Selective Ensemble Attack (SEA),其核心在于打破所有模型在迭代间保持相同的假设,通过动态选择多样化的模型(从易于获取的预训练模型中)来解耦迭代内和迭代间的模型关系,从而在固定迭代内模型数量以保证效率的前提下,增加迭代间模型多样性以提升可迁移性。

链接: https://arxiv.org/abs/2505.12644
作者: Bo Yang,Hengwei Zhang,Jindong Wang,Yuchen Ren,Chenhao Lin,Chao Shen,Zhengyu Zhao
机构: State Key Laboratory of Math Eng & Adv Computing, Information Engineering University, China (国家数学工程与先进计算重点实验室,信息工程大学,中国); School of Cyber Science and Engineering, Xi’an Jiaotong University, China (网络科学与工程学院,西安交通大学,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In surrogate ensemble attacks, using more surrogate models yields higher transferability but lower resource efficiency. This practical trade-off between transferability and efficiency has largely limited existing attacks despite many pre-trained models are easily accessible online. In this paper, we argue that such a trade-off is caused by an unnecessary common assumption, i.e., all models should be identical across iterations. By lifting this assumption, we can use as many surrogates as we want to unleash transferability without sacrificing efficiency. Concretely, we propose Selective Ensemble Attack (SEA), which dynamically selects diverse models (from easily accessible pre-trained models) across iterations based on our new interpretation of decoupling within-iteration and cross-iteration model this http URL this way, the number of within-iteration models is fixed for maintaining efficiency, while only cross-iteration model diversity is increased for higher transferability. Experiments on ImageNet demonstrate the superiority of SEA in various scenarios. For example, when dynamically selecting 4 from 20 accessible models, SEA yields 8.5% higher transferability than existing attacks under the same efficiency. The superiority of SEA also generalizes to real-world systems, such as commercial vision APIs and large vision-language models. Overall, SEA opens up the possibility of adaptively balancing transferability and efficiency according to specific resource requirements.
zh

[CV-96] wo out of Three (ToT): using self-consistency to make robust predictions

【速读】:该论文试图解决深度学习(Deep Learning, DL)模型在高风险领域部署时因决策不可解释而带来的潜在风险问题。其解决方案的关键在于提出一种名为“Two out of Three (ToT)”的算法,该算法通过允许模型在不确定时拒绝回答,从而提高决策的鲁棒性。ToT的核心思想是借鉴人类大脑对冲突信息的敏感性,生成两个额外的预测,并基于这些预测来决定是否提供答案。

链接: https://arxiv.org/abs/2505.12642
作者: Jung Hoon Lee,Sujith Vijayan
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室); School of Neuroscience (神经科学学院); Virginia Tech (弗吉尼亚理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 main figures, 1 supplementary table and 2 supplementary figures

点击查看摘要

Abstract:Deep learning (DL) can automatically construct intelligent agents, deep neural networks (alternatively, DL models), that can outperform humans in certain tasks. However, the operating principles of DL remain poorly understood, making its decisions incomprehensible. As a result, it poses a great risk to deploy DL in high-stakes domains in which mistakes or errors may lead to critical consequences. Here, we aim to develop an algorithm that can help DL models make more robust decisions by allowing them to abstain from answering when they are uncertain. Our algorithm, named `Two out of Three (ToT)', is inspired by the sensitivity of the human brain to conflicting information. ToT creates two alternative predictions in addition to the original model prediction and uses the alternative predictions to decide whether it should provide an answer or not.
zh

[CV-97] Single Image Reflection Removal via inter-layer Complementarity

【速读】:该论文试图解决双流架构在单图像反射去除任务中未能充分挖掘层间互补性的物理建模与网络设计问题,从而限制了图像分离的质量。解决方案的关键在于提出两种针对性改进:首先,引入一种新的层间互补性模型,通过双流架构使残差层提取的低频成分与透射层交互以增强层间互补性,同时残差层的高频成分对两个流进行逆调制,提升透射层的细节质量;其次,提出一种高效的层间互补性注意力机制,通过通道级的跨重组获得具有层间互补结构的重排流,并在重排流上进行注意力计算以实现更好的层间分离,最后恢复原始流结构进行输出。

链接: https://arxiv.org/abs/2505.12641
作者: Yue Huang,Zi’ang Li,Tianle Hu,Jie Wen,Guanbin Li,Jinglin Zhang,Guoxu Zhou,Xiaozhao Fang
机构: Guangdong University of Technology (广东工业大学); Harbin Institute of Technology (哈尔滨工业大学); Sun Yat-sen University (中山大学); Shandong University (山东大学
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although dual-stream architectures have achieved remarkable success in single image reflection removal, they fail to fully exploit inter-layer complementarity in their physical modeling and network design, which limits the quality of image separation. To address this fundamental limitation, we propose two targeted improvements to enhance dual-stream architectures: First, we introduce a novel inter-layer complementarity model where low-frequency components extracted from the residual layer interact with the transmission layer through dual-stream architecture to enhance inter-layer complementarity. Meanwhile, high-frequency components from the residual layer provide inverse modulation to both streams, improving the detail quality of the transmission layer. Second, we propose an efficient inter-layer complementarity attention mechanism which first cross-reorganizes dual streams at the channel level to obtain reorganized streams with inter-layer complementary structures, then performs attention computation on the reorganized streams to achieve better inter-layer separation, and finally restores the original stream structure for output. Experimental results demonstrate that our method achieves state-of-the-art separation quality on multiple public datasets while significantly reducing both computational cost and model complexity.
zh

[CV-98] MVPainter: Accurate and Detailed 3D Texture Generation via Multi-View Diffusion with Geometric Control

【速读】:该论文旨在解决3D纹理生成(3D texture generation)在参考纹理对齐、几何-纹理一致性以及局部纹理质量三个核心维度上的不足。其解决方案的关键在于提出MVPainter,该方法通过数据过滤与增强策略提升纹理保真度和细节表现,并引入基于ControlNet的几何条件控制以改善纹理与几何结构的对齐;同时,从生成视图中提取物理基础渲染(PBR)属性,生成适用于实际渲染应用的PBR网格。

链接: https://arxiv.org/abs/2505.12635
作者: Mingqi Shao,Feng Xiong,Zhaoxu Sun,Mu Xu
机构: AMAP(AMAP)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recently, significant advances have been made in 3D object generation. Building upon the generated geometry, current pipelines typically employ image diffusion models to generate multi-view RGB images, followed by UV texture reconstruction through texture baking. While 3D geometry generation has improved significantly, supported by multiple open-source frameworks, 3D texture generation remains underexplored. In this work, we systematically investigate 3D texture generation through the lens of three core dimensions: reference-texture alignment, geometry-texture consistency, and local texture quality. To tackle these issues, we propose MVPainter, which employs data filtering and augmentation strategies to enhance texture fidelity and detail, and introduces ControlNet-based geometric conditioning to improve texture-geometry alignment. Furthermore, we extract physically-based rendering (PBR) attributes from the generated views to produce PBR meshes suitable for real-world rendering applications. MVPainter achieves state-of-the-art results across all three dimensions, as demonstrated by human-aligned evaluations. To facilitate further research and reproducibility, we also release our full pipeline as an open-source system, including data construction, model architecture, and evaluation tools.
zh

[CV-99] Multi-Resolution Haar Network: Enhancing human motion prediction via Haar transform

【速读】:该论文旨在解决3D人体姿态预测中由于人类运动序列的任意性在时间与空间轴上的转换关系未被充分建模而导致的预测精度不足问题,特别是在复杂动作如随意姿势或问候时表现不佳。其解决方案的关键在于提出了一种名为HaarMoDic的网络结构,该结构通过2D Haar变换将关节坐标投影到更高分辨率的空间中,使网络能够同时获取时间和空间信息。其中,多分辨率Haar(MR-Haar)块是该网络的核心贡献模块,它通过将整个运动序列映射到高分辨率的混合坐标,使得网络能够在不同分辨率下充分利用两个轴的信息,从而提升预测的准确性。

链接: https://arxiv.org/abs/2505.12631
作者: Li Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The 3D human pose is vital for modern computer vision and computer graphics, and its prediction has drawn attention in recent years. 3D human pose prediction aims at forecasting a human’s future motion from the previous sequence. Ignoring that the arbitrariness of human motion sequences has a firm origin in transition in both temporal and spatial axes limits the performance of state-of-the-art methods, leading them to struggle with making precise predictions on complex cases, e.g., arbitrarily posing or greeting. To alleviate this problem, a network called HaarMoDic is proposed in this paper, which utilizes the 2D Haar transform to project joints to higher resolution coordinates where the network can access spatial and temporal information simultaneously. An ablation study proves that the significant contributing module within the HaarModic Network is the Multi-Resolution Haar (MR-Haar) block. Instead of mining in one of two axes or extracting separately, the MR-Haar block projects whole motion sequences to a mixed-up coordinate in higher resolution with 2D Haar Transform, allowing the network to give scope to information from both axes in different resolutions. With the MR-Haar block, the HaarMoDic network can make predictions referring to a broader range of information. Experimental results demonstrate that HaarMoDic surpasses state-of-the-art methods in every testing interval on the Human3.6M dataset in the Mean Per Joint Position Error (MPJPE) metric.
zh

[CV-100] Degradation-Aware Feature Perturbation for All-in-One Image Restoration CVPR2025

【速读】:该论文旨在解决全功能图像修复(all-in-one image restoration)中因退化类型和程度差异大而导致的模型泛化能力不足问题,特别是任务干扰(task interference)问题,即不同任务在共享参数下的梯度更新方向可能相互冲突。解决方案的关键在于引入了退化感知特征扰动(Degradation-aware Feature Perturbations, DFP),通过通道级扰动和注意力级扰动调整特征空间,使其与统一参数空间对齐,从而提升模型的鲁棒性和跨任务性能。为此,提出了退化引导扰动块(Degradation-Guided Perturbation Block, DGPB)来实现这两种扰动机制。

链接: https://arxiv.org/abs/2505.12630
作者: Xiangpeng Tian,Xiangyu Liao,Xiao Liu,Meng Li,Chao Ren
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025. 8 pages, 7 figures

点击查看摘要

Abstract:All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture. Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. Our codes are available at this https URL.
zh

[CV-101] BusterX: MLLM -Powered AI-Generated Video Forgery Detection and Explanation

【速读】:该论文旨在解决AI生成视频的伪造检测问题,特别是针对当前缺乏大规模、高质量的AI生成视频数据集以及现有检测方法在模型决策可解释性和提供可操作性指导方面的不足。其解决方案的关键在于提出GenBuster-200K数据集,该数据集包含20万条高分辨率视频片段,涵盖多种最新的生成技术及真实场景;同时引入BusterX框架,该框架结合多模态大语言模型(MLLM)与强化学习,实现视频真实性判定及可解释性推理。

链接: https://arxiv.org/abs/2505.12620
作者: Haiquan Wen,Yiwei He,Zhenglin Huang,Tianxiao Li,Zihan YU,Xingru Huang,Lu Qi,Baoyuan Wu,Xiangtai Li,Guangliang Cheng
机构: University of Liverpool, UK; Nanyang Technological University, SG; The Chinese University of Hong Kong, Shenzhen, Guangdong, China; Wuhan University; Hangzhou Dianzi University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of large-scale, high-quality AI-generated video datasets for forgery detection. In addition, existing detection approaches predominantly treat the task as binary classification, lacking explainability in model decision-making and failing to provide actionable insights or guidance for the public. To address these challenges, we propose \textbfGenBuster-200K, a large-scale AI-generated video dataset featuring 200K high-resolution video clips, diverse latest generative techniques, and real-world scenes. We further introduce \textbfBusterX, a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning for authenticity determination and explainable rationale. To our knowledge, GenBuster-200K is the \it \textbffirst large-scale, high-quality AI-generated video dataset that incorporates the latest generative techniques for real-world scenarios. BusterX is the \it \textbffirst framework to integrate MLLM with reinforcement learning for explainable AI-generated video detection. Extensive comparisons with state-of-the-art methods and ablation studies validate the effectiveness and generalizability of BusterX. The code, models, and datasets will be released.
zh

[CV-102] Diff-MM: Exploring Pre-trained Text-to-Image Generation Model for Unified Multi-modal Object Tracking

【速读】:该论文旨在解决多模态目标跟踪中因多模态训练数据有限而导致的性能不足问题。现有方法通常从基于RGB的跟踪器出发,仅通过训练数据学习辅助模态的理解,受限于数据量,效果不理想。解决方案的关键在于利用预训练文本到图像生成模型的多模态理解能力,通过提出的并行特征提取管道,将Stable Diffusion的UNet作为跟踪特征提取器,并引入多模态子模块调优方法,以获取不同模态间的互补信息,从而实现统一参数的RGB-N/D/T/E跟踪。

链接: https://arxiv.org/abs/2505.12606
作者: Shiyu Xuan,Zechao Li,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal object tracking integrates auxiliary modalities such as depth, thermal infrared, event flow, and language to provide additional information beyond RGB images, showing great potential in improving tracking stabilization in complex scenarios. Existing methods typically start from an RGB-based tracker and learn to understand auxiliary modalities only from training data. Constrained by the limited multi-modal training data, the performance of these methods is unsatisfactory. To alleviate this limitation, this work proposes a unified multi-modal tracker Diff-MM by exploiting the multi-modal understanding capability of the pre-trained text-to-image generation model. Diff-MM leverages the UNet of pre-trained Stable Diffusion as a tracking feature extractor through the proposed parallel feature extraction pipeline, which enables pairwise image inputs for object tracking. We further introduce a multi-modal sub-module tuning method that learns to gain complementary information between different modalities. By harnessing the extensive prior knowledge in the generation model, we achieve a unified tracker with uniform parameters for RGB-N/D/T/E tracking. Experimental results demonstrate the promising performance of our method compared with recently proposed trackers, e.g., its AUC outperforms OneTracker by 8.3% on TNL2K.
zh

[CV-103] mporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding

【速读】:该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在视频理解任务中依赖隐式时间理解能力而未能明确关键影响因素的问题,这可能限制了其在视频理解中的潜力。解决方案的关键在于通过实证研究揭示了视觉编码器与大语言模型之间中间接口对时间理解的重要影响,并提出了一个面向时间的方案,包括面向时间的训练策略和扩展的接口,从而显著提升了LVLMs在标准视频理解任务中的性能。

链接: https://arxiv.org/abs/2505.12605
作者: Thong Nguyen,Zhiyuan Hu,Xu Lin,Cong-Duy Nguyen,See-Kiong Ng,Luu Anh Tuan
机构: National University of Singapore(新加坡国立大学); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In Progress

点击查看摘要

Abstract:Recent years have witnessed outstanding advances of large vision-language models (LVLMs). In order to tackle video understanding, most of them depend upon their implicit temporal understanding capacity. As such, they have not deciphered important components that contribute to temporal understanding ability, which might limit the potential of these LVLMs for video understanding. In this work, we conduct a thorough empirical study to demystify crucial components that influence the temporal understanding of LVLMs. Our empirical study reveals that significant impacts are centered around the intermediate interface between the visual encoder and the large language model. Building on these insights, we propose a temporal-oriented recipe that encompasses temporal-oriented training schemes and an upscaled interface. Our final model developed using our recipe significantly enhances previous LVLMs on standard video understanding tasks.
zh

[CV-104] Learning Cross-Spectral Point Features with Task-Oriented Training ICRA’25

【速读】:该论文试图解决在低可见度条件下,基于可见光谱的无人机(UAV)导航系统性能下降的问题,其核心是将热成像(thermal imagery)有效集成到现有的视觉导航系统中。解决方案的关键在于利用学习的跨光谱(thermal-visible)点特征,通过在匹配和配准任务上训练特征网络,而非仅关注热成像与可见光图像在外观上的相似区域,从而更充分地利用可用数据。

链接: https://arxiv.org/abs/2505.12593
作者: Mia Thomas,Trevor Ablett,Jonathan Kelly
机构: STARS Laboratory at the University of Toronto Institute for Aerospace Studies (STARS 实验室 at 多伦多大学航空航天研究院); Vector Institute for Artificial Intelligence (Vector 人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the {IEEE} International Conference on Robotics and Automation {(ICRA’25)} Thermal Infrared in Robotics (TIRO) Workshop, Atlanta, Georgia, USA, May 19, 2025

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) enable operations in remote and hazardous environments, yet the visible-spectrum, camera-based navigation systems often relied upon by UAVs struggle in low-visibility conditions. Thermal cameras, which capture long-wave infrared radiation, are able to function effectively in darkness and smoke, where visible-light cameras fail. This work explores learned cross-spectral (thermal-visible) point features as a means to integrate thermal imagery into established camera-based navigation systems. Existing methods typically train a feature network’s detection and description outputs directly, which often focuses training on image regions where thermal and visible-spectrum images exhibit similar appearance. Aiming to more fully utilize the available data, we propose a method to train the feature network on the tasks of matching and registration. We run our feature network on thermal-visible image pairs, then feed the network response into a differentiable registration pipeline. Losses are applied to the matching and registration estimates of this pipeline. Our selected model, trained on the task of matching, achieves a registration error (corner error) below 10 pixels for more than 75% of estimates on the MultiPoint dataset. We further demonstrate that our model can also be used with a classical pipeline for matching and registration.
zh

[CV-105] SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models

【速读】:该论文旨在解决监控视频内容理解这一在视觉-语言研究中关键但研究不足的问题,尤其针对其实时性、事件动态的不规则性以及安全相关的影响。其解决方案的关键在于构建了SurveillanceVQA-589K,这是针对监控领域最大的开放性视频问答基准数据集,包含589,380个涵盖12种认知多样问题类型的问答对。为大规模构建该基准,作者设计了一种混合标注流程,结合时间对齐的人工撰写描述与基于提示技术的大型视觉-语言模型辅助问答生成,并提出了多维评估协议以评估上下文、时间和因果理解能力。

链接: https://arxiv.org/abs/2505.12589
作者: Bo Liu,Pengfei Qiao,Minhan Ma,Xuange Zhang,Yinan Tang,Peng Xu,Kun Liu,Tongtong Yuan
机构: Beijing University of Technology (北京工业大学); Inspur Electronic Information Industry Co., Ltd (浪潮电子信息产业股份有限公司); Tsinghua University (清华大学); JD Explore Academy (京东探索研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The dataset and code are publicly available at: this https URL

点击查看摘要

Abstract:Understanding surveillance video content remains a critical yet underexplored challenge in vision-language research, particularly due to its real-world complexity, irregular event dynamics, and safety-critical implications. In this work, we introduce SurveillanceVQA-589K, the largest open-ended video question answering benchmark tailored to the surveillance domain. The dataset comprises 589,380 QA pairs spanning 12 cognitively diverse question types, including temporal reasoning, causal inference, spatial understanding, and anomaly interpretation, across both normal and abnormal video scenarios. To construct the benchmark at scale, we design a hybrid annotation pipeline that combines temporally aligned human-written captions with Large Vision-Language Model-assisted QA generation using prompt-based techniques. We also propose a multi-dimensional evaluation protocol to assess contextual, temporal, and causal comprehension. We evaluate eight LVLMs under this framework, revealing significant performance gaps, especially in causal and anomaly-related tasks, underscoring the limitations of current models in real-world surveillance contexts. Our benchmark provides a practical and comprehensive resource for advancing video-language understanding in safety-critical applications such as intelligent monitoring, incident analysis, and autonomous decision-making.
zh

[CV-106] Event-based Star Tracking under Spacecraft Jitter: the e-STURT Dataset

【速读】:该论文旨在解决空间飞行器在光学通信、地球观测和空间域感知中因抖动(jitter)导致的精细指向能力下降问题。解决方案的关键在于提出了一种基于事件相机(event camera)的抖动数据集——e-STURT,该数据集通过专用硬件模拟真实的机载抖动条件,利用压电致动器引入系统性且可重复的抖动,并通过事件相机获取高时间分辨率的星体观测数据。该数据集为开发面向任务关键型事件驱动空间感知应用的抖动感知算法提供了基础。

链接: https://arxiv.org/abs/2505.12588
作者: Samya Bagchi,Peter Anastasiou,Matthew Tetlow,Tat-Jun Chin,Yasir Latif
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); Inovor Technologies (Inovor Technologies)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Jitter degrades a spacecraft’s fine-pointing ability required for optical communication, earth observation, and space domain awareness. Development of jitter estimation and compensation algorithms requires high-fidelity sensor observations representative of on-board jitter. In this work, we present the Event-based Star Tracking Under Jitter (e-STURT) dataset – the first event camera based dataset of star observations under controlled jitter conditions. Specialized hardware employed for the dataset emulates an event-camera undergoing on-board jitter. While the event camera provides asynchronous, high temporal resolution star observations, systematic and repeatable jitter is introduced using a micrometer accurate piezoelectric actuator. Various jitter sources are simulated using distinct frequency bands and utilizing both axes of motion. Ground-truth jitter is captured in hardware from the piezoelectric actuator. The resulting dataset consists of 200 sequences and is made publicly available. This work highlights the dataset generation process, technical challenges and the resulting limitations. To serve as a baseline, we propose a high-frequency jitter estimation algorithm that operates directly on the event stream. The e-STURT dataset will enable the development of jitter aware algorithms for mission critical event-based space sensing applications.
zh

[CV-107] An approach based on class activation maps for investigating the effects of data augmentation on neural networks for image classification

【速读】:该论文试图解决数据增强方法在卷积神经网络(Convolutional Neural Networks, CNNs)图像分类任务中对模型所学习特征模式的影响缺乏系统性定量分析的问题。解决方案的关键在于提出一种基于类别激活图(Class Activation Maps, CAMs)的分析方法,通过提取不同数据增强策略下模型生成的激活图之间的相似性和差异性度量,实现对数据增强效果的量化评估。

链接: https://arxiv.org/abs/2505.12581
作者: Lucas M. Dorneles,Luan Fonseca Garcia,Joel Luís Carbonera
机构: Federal University of Rio Grande do Sul (南里奥格兰德州联邦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural networks have become increasingly popular in the last few years as an effective tool for the task of image classification due to the impressive performance they have achieved on this task. In image classification tasks, it is common to use data augmentation strategies to increase the robustness of trained networks to changes in the input images and to avoid overfitting. Although data augmentation is a widely adopted technique, the literature lacks a body of research analyzing the effects data augmentation methods have on the patterns learned by neural network models working on complex datasets. The primary objective of this work is to propose a methodology and set of metrics that may allow a quantitative approach to analyzing the effects of data augmentation in convolutional networks applied to image classification. An important tool used in the proposed approach lies in the concept of class activation maps for said models, which allow us to identify and measure the importance these models assign to each individual pixel in an image when executing the classification task. From these maps, we may then extract metrics over the similarities and differences between maps generated by these models trained on a given dataset with different data augmentation strategies. Experiments made using this methodology suggest that the effects of these data augmentation techniques not only can be analyzed in this way but also allow us to identify different impact profiles over the trained models.
zh

[CV-108] Coarse Attribute Prediction with Task Agnostic Distillation for Real World Clothes Changing ReID

【速读】:该论文旨在解决真实世界中衣物变化重识别(Clothes Changing Re-IDentification, CC-ReID)任务中由于低质量(Low-Quality, LQ)图像带来的挑战。现有方法在高质量(High-Quality, HQ)图像上表现良好,但在LQ图像中因像素化、失焦模糊和运动模糊等伪影导致外部生物特征属性(如姿态、体型等)和模型内部特征表示均受到干扰,进而影响识别性能。论文提出的解决方案关键在于一种名为RLQ的框架,其核心包括粗粒度属性预测(Coarse Attributes Prediction, CAP)和任务无关蒸馏(Task Agnostic Distillation, TAD),通过交替训练机制提升模型对LQ图像的鲁棒性。CAP通过粗略预测增强模型对外部细粒度属性的感知能力,而TAD则通过跨数据集的任务无关自监督与蒸馏优化内部特征表示,从而提升模型在真实场景下的性能。

链接: https://arxiv.org/abs/2505.12580
作者: Priyank Pathak,Yogesh S Rawat
机构: University of Central Florida (佛罗里达中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work focuses on Clothes Changing Re-IDentification (CC-ReID) for the real world. Existing works perform well with high-quality (HQ) images, but struggle with low-quality (LQ) where we can have artifacts like pixelation, out-of-focus blur, and motion blur. These artifacts introduce noise to not only external biometric attributes (e.g. pose, body shape, etc.) but also corrupt the model’s internal feature representation. Models usually cluster LQ image features together, making it difficult to distinguish between them, leading to incorrect matches. We propose a novel framework Robustness against Low-Quality (RLQ) to improve CC-ReID model on real-world data. RLQ relies on Coarse Attributes Prediction (CAP) and Task Agnostic Distillation (TAD) operating in alternate steps in a novel training mechanism. CAP enriches the model with external fine-grained attributes via coarse predictions, thereby reducing the effect of noisy inputs. On the other hand, TAD enhances the model’s internal feature representation by bridging the gap between HQ and LQ features, via an external dataset through task-agnostic self-supervision and distillation. RLQ outperforms the existing approaches by 1.6%-2.9% Top-1 on real-world datasets like LaST, and DeepChange, while showing consistent improvement of 5.3%-6% Top-1 on PRCC with competitive performance on LTCC. The code will be made public soon.
zh

[CV-109] VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

【速读】:该论文试图解决在仅使用未校准单目相机的情况下,构建高精度密集RGB SLAM系统的问题,特别是在传统方法依赖相似性变换(即平移、旋转和尺度)进行子图对齐时存在的不足。解决方案的关键在于重新审视重建模糊性问题,即在无相机运动或场景结构假设的前提下,场景只能以15自由度的射影变换进行重建。为此,作者提出在SL(4)流形上进行优化,从而估计序列子图之间的15自由度单应性变换,并考虑潜在的闭环约束,实现了跨子图的一致场景重建。

链接: https://arxiv.org/abs/2505.12549
作者: Dominic Maggio,Hyungtae Lim,Luca Carlone
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.
zh

[CV-110] ProMi: An Efficient Prototype-Mixture Baseline for Few-Shot Segmentation with Bounding-Box Annotations

【速读】:该论文旨在解决机器人应用中少样本分割(few-shot segmentation)的问题,即在仅有少量标注数据的情况下,使机器人能够完成复杂任务并适应多样化的现实环境。传统方法依赖于像素级标注,但这种标注方式耗时且成本高昂。本文提出的解决方案关键在于使用边界框(bounding-box)标注替代像素级标签,通过ProMi方法——一种基于原型混合的高效分割方法,将背景类视为分布的混合体,从而实现无需训练、简单有效的少样本分割。该方法能够灵活处理粗略标注,并在多个数据集上取得了显著优于现有基线的结果。

链接: https://arxiv.org/abs/2505.12547
作者: Florent Chiaroni,Ali Ayub,Ola Ahmad
机构: Thales(泰雷兹); Concordia University(康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In robotics applications, few-shot segmentation is crucial because it allows robots to perform complex tasks with minimal training data, facilitating their adaptation to diverse, real-world environments. However, pixel-level annotations of even small amount of images is highly time-consuming and costly. In this paper, we present a novel few-shot binary segmentation method based on bounding-box annotations instead of pixel-level labels. We introduce, ProMi, an efficient prototype-mixture-based method that treats the background class as a mixture of distributions. Our approach is simple, training-free, and effective, accommodating coarse annotations with ease. Compared to existing baselines, ProMi achieves the best results across different datasets with significant gains, demonstrating its effectiveness. Furthermore, we present qualitative experiments tailored to real-world mobile robot tasks, demonstrating the applicability of our approach in such scenarios. Our code: this https URL.
zh

[CV-111] Exploring Sparsity for Parameter Efficient Fine Tuning Using Wavelets

【速读】:该论文试图解决在计算和内存预算受限的情况下,如何高效适应大型基础模型的问题,特别是针对参数效率有限的参数高效微调(PEFT)方法如LoRA在少量参数情形下的粒度和效果不足。解决方案的关键在于提出一种新的PEFT方法——小波微调(Wavelet Fine-Tuning, WaveFT),该方法在残差矩阵的小波域中学习高度稀疏的更新,从而实现对可训练参数的精确控制,具备细粒度的容量调整能力,并在参数数量显著低于LoRA的情况下表现出色。

链接: https://arxiv.org/abs/2505.12532
作者: Ahmet Bilican,M. Akın Yılmaz,A. Murat Tekalp,R. Gökberk Cinbiş
机构: Koç University (科克大学); Codeway AI Research (科德韦人工智能研究); Middle East Technical University (中东技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Efficiently adapting large foundation models is critical, especially with tight compute and memory budgets. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA offer limited granularity and effectiveness in few-parameter regimes. We propose Wavelet Fine-Tuning (WaveFT), a novel PEFT method that learns highly sparse updates in the wavelet domain of residual matrices. WaveFT allows precise control of trainable parameters, offering fine-grained capacity adjustment and excelling with remarkably low parameter count, potentially far fewer than LoRA’s minimum – ideal for extreme parameter-efficient scenarios. In order to demonstrate the effect of the wavelet transform, we compare WaveFT with a special case, called SHiRA, that entails applying sparse updates directly in the weight domain. Evaluated on personalized text-to-image generation using Stable Diffusion XL as baseline, WaveFT significantly outperforms LoRA and other PEFT methods, especially at low parameter counts; achieving superior subject fidelity, prompt alignment, and image diversity.
zh

[CV-112] GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species Classification

【速读】:该论文旨在解决全球树种遥感分类中因缺乏大规模标注数据集而导致的进展受限问题。其关键解决方案是构建了GlobalGeoTree数据集,该数据集包含630万条地理定位的树种出现记录,并与Sentinel-2影像时间序列及27个辅助环境变量相匹配,为模型预训练和评估提供了丰富的多模态数据支持。此外,研究还提出了GeoTreeCLIP模型,通过在视觉-语言框架中利用遥感数据与分类学文本标签的配对信息,提升了零样本和少样本分类性能。

链接: https://arxiv.org/abs/2505.12513
作者: Yang Mu,Zhitong Xiong,Yi Wang,Muhammad Shahzad,Franz Essl,Mark van Kleunen,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); University of Vienna (维也纳大学); University of Konstanz (康斯坦茨大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Global tree species mapping using remote sensing data is vital for biodiversity monitoring, forest management, and ecological research. However, progress in this field has been constrained by the scarcity of large-scale, labeled datasets. To address this, we introduce GlobalGeoTree, a comprehensive global dataset for tree species classification. GlobalGeoTree comprises 6.3 million geolocated tree occurrences, spanning 275 families, 2,734 genera, and 21,001 species across the hierarchical taxonomic levels. Each sample is paired with Sentinel-2 image time series and 27 auxiliary environmental variables, encompassing bioclimatic, geographic, and soil data. The dataset is partitioned into GlobalGeoTree-6M for model pretraining and curated evaluation subsets, primarily GlobalGeoTree-10kEval for zero-shot and few-shot benchmarking. To demonstrate the utility of the dataset, we introduce a baseline model, GeoTreeCLIP, which leverages paired remote sensing data and taxonomic text labels within a vision-language framework pretrained on GlobalGeoTree-6M. Experimental results show that GeoTreeCLIP achieves substantial improvements in zero- and few-shot classification on GlobalGeoTree-10kEval over existing advanced models. By making the dataset, models, and code publicly available, we aim to establish a benchmark to advance tree species classification and foster innovation in biodiversity research and ecological applications.
zh

[CV-113] Scalable Strategies for Continual Learning with Replay

【速读】:该论文试图解决持续学习(continual learning)中的可扩展性问题,特别是在面对多个顺序任务时,如何高效地整合新知识与旧知识。其解决方案的关键在于引入一种称为“巩固”(consolidation)的阶段性重放方法,以及一种适用于持续学习场景的序列合并(sequential merging)技术。这些方法共同构成了一个高度可扩展的工具集,能够有效减少重放样本需求并提升模型性能。

链接: https://arxiv.org/abs/2505.12512
作者: Truman Hickok
机构: Southwest Research Institute (西南研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Future deep learning models will be distinguished by systems that perpetually learn through interaction, imagination, and cooperation, blurring the line between training and inference. This makes continual learning a critical challenge, as methods that efficiently maximize bidirectional transfer across learning trajectories will be essential. Replay is on track to play a foundational role in continual learning, allowing models to directly reconcile new information with past knowledge. In practice, however, replay is quite unscalable, doubling the cost of continual learning when applied naively. Moreover, the continual learning literature has not fully synchronized with the multi-task fine-tuning literature, having not fully integrated highly scalable techniques like model merging and low rank adaptation into a replay-enabled toolset that can produce a unified model in the face of many sequential tasks. In this paper, we begin by applying and analyzing low rank adaptation in a continual learning setting. Next, we introduce consolidation, a phasic approach to replay which leads to up to 55% less replay samples being needed for a given performance target. Then, we propose sequential merging, an offshoot of task arithmetic which is tailored to the continual learning setting and is shown to work well in combination with replay. Finally, we demonstrate that the developed strategies can operate synergistically, resulting in a highly scalable toolset that outperforms standalone variants.
zh

[CV-114] Rebalancing Contrastive Alignment with Learnable Semantic Gaps in Text-Video Retrieval

【速读】:该论文旨在解决文本-视频检索中由于模态间隙(modality gap)和批次采样中的虚假负例导致的对比学习框架中的优化矛盾问题,这些问题引发了InfoNCE损失下的冲突梯度,阻碍了稳定对齐。其解决方案的关键在于提出GARE(Gap-Aware Retrieval)框架,通过引入一个可学习的、成对的增量项Δ_ij,将张力从全局锚点表示中卸载,从而缓解梯度冲突。该增量项通过在信任域约束下对InfoNCE损失进行耦合多元一阶泰勒近似推导得出,并通过轻量级神经模块实现,该模块基于视频-文本对的语义差距进行结构感知校正,同时结合信任域约束、方向多样性项和信息瓶颈进行正则化以提升学习稳定性与可解释性。

链接: https://arxiv.org/abs/2505.12499
作者: Jian Xiao,Zijie Song,Jialong Hu,Hao Cheng,Zhenzhen Hu,Jia Li,Richang Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advances in text-video retrieval have been largely driven by contrastive learning frameworks. However, existing methods overlook a key source of optimization tension: the separation between text and video distributions in the representation space (referred to as the modality gap), and the prevalence of false negatives in batch sampling. These factors lead to conflicting gradients under the InfoNCE loss, impeding stable alignment. To mitigate this, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment Delta_ij between text t_i and video v_j to offload the tension from the global anchor representation. We first derive the ideal form of Delta_ij via a coupled multivariate first-order Taylor approximation of the InfoNCE loss under a trust-region constraint, revealing it as a mechanism for resolving gradient conflicts by guiding updates along a locally optimal descent direction. Due to the high cost of directly computing Delta_ij, we introduce a lightweight neural module conditioned on the semantic gap between each video-text pair, enabling structure-aware correction guided by gradient supervision. To further stabilize learning and promote interpretability, we regularize Delta using three components: a trust-region constraint to prevent oscillation, a directional diversity term to promote semantic coverage, and an information bottleneck to limit redundancy. Experiments across four retrieval benchmarks show that GARE consistently improves alignment accuracy and robustness to noisy supervision, confirming the effectiveness of gap-aware tension mitigation.
zh

[CV-115] Video-GPT via Next Clip Diffusion

【速读】:该论文旨在解决传统自然语言处理模型在描述视觉世界中时空细节方面的不足,通过将视频视为新的语言来实现视觉世界建模。其解决方案的关键在于提出了一种名为Video-GPT的简洁模型,并引入了一种新颖的下一片段扩散(next clip diffusion)预训练范式,该范式通过自回归地去噪噪声片段并依据历史中的干净片段进行预测,使模型能够同时处理短期生成和长期预测任务。

链接: https://arxiv.org/abs/2505.12489
作者: Shaobin Zhuang,Zhipeng Huang,Ying Zhang,Fangyikang Wang,Canmiao Fu,Binxin Yang,Chong Sun,Chen Li,Yali Wang
机构: Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); WeChat Vision, Tencent Inc. (微信视觉,腾讯公司); Zhejiang University (浙江大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Shanghai AI Laboratory (上海人工智能实验室
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 12 figures, 18 tables

点击查看摘要

Abstract:GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at this https URL.
zh

[CV-116] Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation CVPR

【速读】:该论文试图解决文本到图像生成模型在输出中缺乏细粒度控制的问题,现有引导方法如分割图和深度图引入了空间刚性,限制了扩散模型的固有多样性。论文提出的解决方案是引入深度几何矩(Deep Geometric Moments, DGM),其关键在于通过学习的几何先验捕捉主体的视觉特征和细节,相较于DINO或CLIP特征更关注主体本身而非全局图像特征,且相比ResNets对像素级扰动更鲁棒,从而在扩散图像生成中实现控制与多样性的有效平衡。

链接: https://arxiv.org/abs/2505.12486
作者: Sangmin Jung,Utkarsh Nath,Yezhou Yang,Giulia Pedrielli,Joydeep Biswas,Amy Zhang,Hassan Ghasemzadeh,Pavan Turaga
机构: Arizona State University (亚利桑那州立大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR Workshop GMCV 2025

点击查看摘要

Abstract:Text-to-image generation models have achieved remarkable capabilities in synthesizing images, but often struggle to provide fine-grained control over the output. Existing guidance approaches, such as segmentation maps and depth maps, introduce spatial rigidity that restricts the inherent diversity of diffusion models. In this work, we introduce Deep Geometric Moments (DGM) as a novel form of guidance that encapsulates the subject’s visual features and nuances through a learned geometric prior. DGMs focus specifically on the subject itself compared to DINO or CLIP features, which suffer from overemphasis on global image features or semantics. Unlike ResNets, which are sensitive to pixel-wise perturbations, DGMs rely on robust geometric moments. Our experiments demonstrate that DGM effectively balance control and diversity in diffusion-based image generation, allowing a flexible control mechanism for steering the diffusion process.
zh

[CV-117] Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)少样本分类中因标注样本稀缺而导致的性能受限问题。其解决方案的关键在于提出一种名为Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification (S4L-FSC)的方法,通过结合异构与同构数据源进行光谱-空间预训练,分别利用旋转-镜像自监督学习(RM-SSL)和掩码重构自监督学习(MR-SSL)来增强模型对HSI空间几何多样性和光谱先验知识的学习能力,从而提升少样本分类的性能。

链接: https://arxiv.org/abs/2505.12482
作者: Wenchen Chen,Yanmei Zhang,Zhongwei Xiao,Jianping Chu,Xingbo Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Few-shot classification of hyperspectral images (HSI) faces the challenge of scarce labeled samples. Self-Supervised learning (SSL) and Few-Shot Learning (FSL) offer promising avenues to address this issue. However, existing methods often struggle to adapt to the spatial geometric diversity of HSIs and lack sufficient spectral prior knowledge. To tackle these challenges, we propose a method, Spectral-Spatial Self-Supervised Learning for Few-Shot Hyperspectral Image Classification (S4L-FSC), aimed at improving the performance of few-shot HSI classification. Specifically, we first leverage heterogeneous datasets to pretrain a spatial feature extractor using a designed Rotation-Mirror Self-Supervised Learning (RM-SSL) method, combined with FSL. This approach enables the model to learn the spatial geometric diversity of HSIs using rotation and mirroring labels as supervisory signals, while acquiring transferable spatial meta-knowledge through few-shot learning. Subsequently, homogeneous datasets are utilized to pretrain a spectral feature extractor via a combination of FSL and Masked Reconstruction Self-Supervised Learning (MR-SSL). The model learns to reconstruct original spectral information from randomly masked spectral vectors, inferring spectral dependencies. In parallel, FSL guides the model to extract pixel-level discriminative features, thereby embedding rich spectral priors into the model. This spectral-spatial pretraining method, along with the integration of knowledge from heterogeneous and homogeneous sources, significantly enhances model performance. Extensive experiments on four HSI datasets demonstrate the effectiveness and superiority of the proposed S4L-FSC approach for few-shot HSI classification.
zh

[CV-118] Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning

【速读】:该论文试图解决自监督学习(Self Supervised Learning, SSL)中重建(Reconstruction)与联合嵌入(Joint Embedding)两种范式之间的选择问题,即缺乏明确的指导原则来判断在何种场景下应采用哪种方法。解决方案的关键在于通过闭式解分析两种方法的核心机制,揭示视图生成过程(如数据增强)对学习表征的影响,并证明在样本量增加时,两种SSL范式均需在增强操作与无关特征之间保持最小对齐以达到渐近最优。研究进一步表明,在无关特征幅度较大的情况下,联合嵌入方法因其对齐条件更弱而更具优势。

链接: https://arxiv.org/abs/2505.12477
作者: Hugues Van Assel,Mark Ibrahim,Tommaso Biancalani,Aviv Regev,Randall Balestriero
机构: Genentech(基因技术公司); Meta AI(元人工智能); FAIR(脸书人工智能实验室); Brown University(布朗大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 9 figures

点击查看摘要

Abstract:Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.
zh

[CV-119] SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

【速读】:该论文旨在解决视觉语言模型(Visual-Language Models, VLMs)在多模态任务中对RGB输入的依赖所导致的空间理解不精确的问题。现有方法在整合空间线索(如点云或深度信息)时,要么需要专用传感器,要么无法有效利用深度信息进行高级推理。论文提出的解决方案关键在于提出一种名为SSR(Spatial Sense and Reasoning)的新框架,该框架将原始深度数据转换为结构化、可解释的文本推理过程,作为有意义的中间表示以显著提升空间推理能力,并通过知识蒸馏将生成的推理过程压缩为紧凑的潜在嵌入,从而实现资源高效且无需重新训练即可集成到现有VLMs中的目标。

链接: https://arxiv.org/abs/2505.12448
作者: Yang Liu,Ming Ma,Xiaomin Yu,Pengxiang Ding,Han Zhao,Mingyang Sun,Siteng Huang,Donglin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding. Our project page is at this https URL.
zh

[CV-120] VideoRFT: Incentivizing Video Reasoning Capability in MLLM s via Reinforced Fine-Tuning

【速读】:该论文试图解决在视频推理任务中,大型多模态语言模型(Multimodal Large Language Models, MLLMs)缺乏人类级推理能力的问题,尤其是在处理视频数据中复杂的逻辑、时间及因果结构方面。解决方案的关键在于提出一种名为VIDEORFT的新方法,该方法通过扩展强化学习微调(Reinforcement Fine-Tuning, RFT)范式,结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL),并构建一个全自动的链式思维(Chain-of-Thought, CoT)数据生成流程,以提升模型对视频内容的理解和推理能力。

链接: https://arxiv.org/abs/2505.12434
作者: Qi Wang,Yanrui Yu,Ye Yuan,Rui Mao,Tianfei Zhou
机构: Beijing Institute of Technology (北京理工大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VIDEORFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VIDEORFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a fully automatic CoT curation pipeline. First, we devise a cognitioninspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a visual-language model conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets - VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VIDEORFT achieves state-of-the-art performance on six video reasoning benchmarks.
zh

[CV-121] SRLoRA: Subspace Recomposition in Low-Rank Adaptation via Importance-Based Fusion and Reinitialization

【速读】:该论文旨在解决低秩适应(Low-Rank Adaptation, LoRA)方法在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)中因固定低秩子空间更新而导致的表达能力受限和下游任务性能下降的问题。其解决方案的关键在于引入基于重要性融合与重初始化的子空间重构机制(Subspace Recomposition in Low-Rank Adaptation, SRLoRA),通过动态重组低秩子空间,将不重要的LoRA对融合到冻结主干网络中,并在未使用的主成分方向上重新初始化新的LoRA对,从而实现持续的子空间刷新与更丰富的适应能力,同时保持参数量不变。

链接: https://arxiv.org/abs/2505.12433
作者: Haodong Yang,Lei Wang,Md Zakir Hossain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research report

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method that injects two trainable low-rank matrices (A and B) into frozen pretrained models. While efficient, LoRA constrains updates to a fixed low-rank subspace (Delta W = BA), which can limit representational capacity and hinder downstream performance. We introduce Subspace Recomposition in Low-Rank Adaptation (SRLoRA) via importance-based fusion and reinitialization, a novel approach that enhances LoRA’s expressiveness without compromising its lightweight structure. SRLoRA assigns importance scores to each LoRA pair (a column of B and the corresponding row of A), and dynamically recomposes the subspace during training. Less important pairs are fused into the frozen backbone, freeing capacity to reinitialize new pairs along unused principal directions derived from the pretrained weight’s singular value decomposition. This mechanism enables continual subspace refreshment and richer adaptation over time, without increasing the number of trainable parameters. We evaluate SRLoRA on both language and vision tasks, including the GLUE benchmark and various image classification datasets. SRLoRA consistently achieves faster convergence and improved accuracy over standard LoRA, demonstrating its generality, efficiency, and potential for broader PEFT applications.
zh

[CV-122] Observe-R1: Unlocking Reasoning Abilities of MLLM s with Dynamic Progressive Reinforcement Learning

【速读】:该论文旨在解决如何提升多模态大语言模型(Multimodal Large Language Models, MLLMs)的推理能力,特别是在适应强化学习(Reinforcement Learning, RL)过程中面临的多模态数据与格式的特定挑战。其解决方案的关键在于提出一种渐进式学习范式,通过构建NeuraLadder数据集,按照数据样本的难度和复杂度进行组织与采样,以支持RL训练;同时引入多模态格式约束、奖励机制和动态权重分配策略,从而增强模型对视觉信息的感知能力,并优化推理过程的清晰度与简洁性。

链接: https://arxiv.org/abs/2505.12432
作者: Zirun Guo,Minjie Hong,Tao Jin
机构: Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown promise in improving the reasoning abilities of Large Language Models (LLMs). However, the specific challenges of adapting RL to multimodal data and formats remain relatively unexplored. In this work, we present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs). We draw inspirations from human learning progression–from simple to complex and easy to difficult, and propose a gradual learning paradigm for MLLMs. To this end, we construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training. To tackle multimodal tasks, we introduce a multimodal format constraint that encourages careful observation of images, resulting in enhanced visual abilities and clearer and more structured responses. Additionally, we implement a bonus reward system that favors concise, correct answers within a length constraint, alongside a dynamic weighting mechanism that prioritizes uncertain and medium-difficulty problems, ensuring that more informative samples have a greater impact on training. Our experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks, achieving superior clarity and conciseness in reasoning chains. Ablation studies validate the effectiveness of our strategies, highlighting the robustness and generalization of our approach. The dataset and code will be released at this https URL.
zh

[CV-123] DPCD: A Quality Assessment Database for Dynamic Point Clouds

【速读】:该论文试图解决动态点云质量评估(Dynamic Point Cloud Quality Assessment, DPCQA)缺乏系统性研究的问题,尤其是针对动态点云在实际应用中如帧间压缩和传输等质量导向场景的评估需求。解决方案的关键在于构建一个大规模的DPCQA数据库——DPCD,该数据库包含15个参考动态点云和从七种有损压缩及噪声失真类型生成的525个失真动态点云,并通过渲染为处理视频序列(Processed Video Sequences, PVS)进行主观实验,获取平均意见得分(Mean Opinion Scores, MOS)以验证数据库的异质性和可靠性。此外,该研究还评估了多种客观指标在DPCD上的性能,揭示了DPCQA相较于静态点云质量评估的挑战性。

链接: https://arxiv.org/abs/2505.12431
作者: Yating Liu,Yujie Zhang,Qi Yang,Yiling Xu,Zhu Li,Ye-Kui Wang
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri-Kansas City (密苏里大学堪萨斯城分校); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Recently, the advancements in Virtual/Augmented Reality (VR/AR) have driven the demand for Dynamic Point Clouds (DPC). Unlike static point clouds, DPCs are capable of capturing temporal changes within objects or scenes, offering a more accurate simulation of the real world. While significant progress has been made in the quality assessment research of static point cloud, little study has been done on Dynamic Point Cloud Quality Assessment (DPCQA), which hinders the development of quality-oriented applications, such as interframe compression and transmission in practical scenarios. In this paper, we introduce a large-scale DPCQA database, named DPCD, which includes 15 reference DPCs and 525 distorted DPCs from seven types of lossy compression and noise distortion. By rendering these samples to Processed Video Sequences (PVS), a comprehensive subjective experiment is conducted to obtain Mean Opinion Scores (MOS) from 21 viewers for analysis. The characteristic of contents, impact of various distortions, and accuracy of MOSs are presented to validate the heterogeneity and reliability of the proposed database. Furthermore, we evaluate the performance of several objective metrics on DPCD. The experiment results show that DPCQA is more challenge than that of static point cloud. The DPCD, which serves as a catalyst for new research endeavors on DPCQA, is publicly available at this https URL.
zh

[CV-124] Drag LoRA: Online Optimization of LoRA Adapters for Drag -based Image Editing in Diffusion Model ICML2025

【速读】:该论文旨在解决基于拖拽的图像编辑在预训练扩散模型中精度不足和计算效率低的问题。传统方法通过直接优化DDIM反演得到的输入特征并迭代调整以引导控制点到达目标位置,但由于运动监督中特征表示能力有限以及点追踪所需的大搜索空间,导致精度和效率受限。解决方案的关键在于提出DragLoRA框架,该框架将LoRA(Low-Rank Adaptation)适配器集成到拖拽编辑流程中,并引入额外的去噪分数蒸馏损失以增强LoRA适配器的训练,同时通过更新后的LoRA适配器优化输入特征,提升运动监督的一致性,最终设计了一种自适应优化方案,在保证精度的前提下提高计算效率。

链接: https://arxiv.org/abs/2505.12427
作者: Siwei Xia,Li Sun,Tiantian Sun,Qingli Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2025

点击查看摘要

Abstract:Drag-based editing within pretrained diffusion model provides a precise and flexible way to manipulate foreground objects. Traditional methods optimize the input feature obtained from DDIM inversion directly, adjusting them iteratively to guide handle points towards target locations. However, these approaches often suffer from limited accuracy due to the low representation ability of the feature in motion supervision, as well as inefficiencies caused by the large search space required for point tracking. To address these limitations, we present DragLoRA, a novel framework that integrates LoRA (Low-Rank Adaptation) adapters into the drag-based editing pipeline. To enhance the training of LoRA adapters, we introduce an additional denoising score distillation loss which regularizes the online model by aligning its output with that of the original model. Additionally, we improve the consistency of motion supervision by adapting the input features using the updated LoRA, giving a more stable and accurate input feature for subsequent operations. Building on this, we design an adaptive optimization scheme that dynamically toggles between two modes, prioritizing efficiency without compromising precision. Extensive experiments demonstrate that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing. The Codes of DragLoRA are available at: this https URL.
zh

[CV-125] Kornia-rs: A Low-Level 3D Computer Vision Library In Rust

【速读】:该论文旨在解决在安全关键和实时应用中缺乏高效、可靠且支持3D计算机视觉操作的Rust语言库的问题。现有解决方案如基于C++的OpenCV或封装式的OpenCV-Rust存在性能或安全性方面的局限性。该论文提出的解决方案是\textit{kornia-rs},其关键在于从零开始设计,充分利用Rust的拥有权模型和类型系统以确保内存和线程安全,并采用静态类型张量系统和模块化crate结构,从而实现高效的图像输入输出、图像处理及3D操作。此外,通过提供Python绑定,增强了跨平台兼容性和与Rust代码的集成效率。

链接: https://arxiv.org/abs/2505.12425
作者: Edgar Riba,Jian Shi,Aditya Kumar,Andrew Shen,Gary Bradski
机构: Kornia AI Research Organization (Kornia人工智能研究组织); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); OpenCV.org (OpenCV.ORG)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present \textitkornia-rs, a high-performance 3D computer vision library written entirely in native Rust, designed for safety-critical and real-time applications. Unlike C+±based libraries like OpenCV or wrapper-based solutions like OpenCV-Rust, \textitkornia-rs is built from the ground up to leverage Rust’s ownership model and type system for memory and thread safety. \textitkornia-rs adopts a statically-typed tensor system and a modular set of crates, providing efficient image I/O, image processing and 3D operations. To aid cross-platform compatibility, \textitkornia-rs offers Python bindings, enabling seamless and efficient integration with Rust code. Empirical results show that \textitkornia-rs achieves a 3~ 5 times speedup in image transformation tasks over native Rust alternatives, while offering comparable performance to C++ wrapper-based libraries. In addition to 2D vision capabilities, \textitkornia-rs addresses a significant gap in the Rust ecosystem by providing a set of 3D computer vision operators. This paper presents the architecture and performance characteristics of \textitkornia-rs, demonstrating its effectiveness in real-world computer vision applications.
zh

[CV-126] ViEEG: Hierarchical Neural Coding with Cross-Modal Progressive Enhancement for EEG-Based Visual Decoding

【速读】:该论文旨在解决将脑活动解码为视觉表征的问题,特别是在利用脑电图(EEG)数据进行视觉信息重建时,现有方法因依赖扁平化神经表征而未能充分捕捉大脑固有的视觉层次结构。其解决方案的关键在于提出ViEEG框架,该框架受Hubel-Wiesel视觉处理理论启发,通过将视觉刺激分解为轮廓、前景物体和上下文场景三个生物对齐组件,并构建三流EEG编码器,随后通过交叉注意力路由逐步整合特征,模拟从V1到IT再到联合皮层的皮层信息流,同时结合分层对比学习使EEG表征与CLIP嵌入对齐,从而实现零样本目标识别。

链接: https://arxiv.org/abs/2505.12408
作者: Minxu Liu,Donghai Guan,Chuhang Zheng,Chunwei Tian,Jie Wen,Qi Zhu
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Northwestern Polytechnical University (西北工业大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 24 pages, 18 figures

点击查看摘要

Abstract:Understanding and decoding brain activity into visual representations is a fundamental challenge at the intersection of neuroscience and artificial intelligence. While EEG-based visual decoding has shown promise due to its non-invasive, low-cost nature and millisecond-level temporal resolution, existing methods are limited by their reliance on flat neural representations that overlook the brain’s inherent visual hierarchy. In this paper, we introduce ViEEG, a biologically inspired hierarchical EEG decoding framework that aligns with the Hubel-Wiesel theory of visual processing. ViEEG decomposes each visual stimulus into three biologically aligned components-contour, foreground object, and contextual scene-serving as anchors for a three-stream EEG encoder. These EEG features are progressively integrated via cross-attention routing, simulating cortical information flow from V1 to IT to the association cortex. We further adopt hierarchical contrastive learning to align EEG representations with CLIP embeddings, enabling zero-shot object recognition. Extensive experiments on the THINGS-EEG dataset demonstrate that ViEEG achieves state-of-the-art performance, with 40.9% Top-1 accuracy in subject-dependent and 22.9% Top-1 accuracy in cross-subject settings, surpassing existing methods by over 45%. Our framework not only advances the performance frontier but also sets a new paradigm for biologically grounded brain decoding in AI.
zh

[CV-127] CLIP-aware Domain-Adaptive Super-Resolution

【速读】:该论文试图解决单图像超分辨率(Single Image Super-Resolution, SISR)中的领域泛化(domain generalization)问题,即如何使模型在不同领域或数据分布下保持优异的性能。解决方案的关键在于提出了一种结合CLIP(Contrastive Language-Image Pre-training)引导的特征对齐机制与元学习启发的少样本适应策略的框架,通过多阶段的特征处理和融合,将语义信息有效整合到超分辨率流程中,并采用多组件损失函数以提升重建质量与语义一致性。

链接: https://arxiv.org/abs/2505.12391
作者: Zhengyang Lu,Qian Xia,Weifan Wang,Feng Wang
机构: Jiangnan University (江南大学); Changshu Institute of Technology (常熟理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work introduces CLIP-aware Domain-Adaptive Super-Resolution (CDASR), a novel framework that addresses the critical challenge of domain generalization in single image super-resolution. By leveraging the semantic capabilities of CLIP (Contrastive Language-Image Pre-training), CDASR achieves unprecedented performance across diverse domains and extreme scaling factors. The proposed method integrates CLIP-guided feature alignment mechanism with a meta-learning inspired few-shot adaptation strategy, enabling efficient knowledge transfer and rapid adaptation to target domains. A custom domain-adaptive module processes CLIP features alongside super-resolution features through a multi-stage transformation process, including CLIP feature processing, spatial feature generation, and feature fusion. This intricate process ensures effective incorporation of semantic information into the super-resolution pipeline. Additionally, CDASR employs a multi-component loss function that combines pixel-wise reconstruction, perceptual similarity, and semantic consistency. Extensive experiments on benchmark datasets demonstrate CDASR’s superiority, particularly in challenging scenarios. On the Urban100 dataset at \times 8 scaling, CDASR achieves a significant PSNR gain of 0.15dB over existing methods, with even larger improvements of up to 0.30dB observed at \times 16 scaling.
zh

[CV-128] Modeling Aesthetic Preferences in 3D Shapes: A Large-Scale Paired Comparison Study Across Object Categories

【速读】:该论文旨在解决3D形状美学计算模型缺乏大规模人类判断实证基础的问题,从而限制了其实际应用价值。其解决方案的关键在于通过大规模的人类偏好数据(22,301对比较)构建一个基于几何特征的非线性模型,并结合交叉类别分析以揭示影响美学偏好的几何驱动因素。研究采用Bradley-Terry模型推断潜在的美学评分,并利用随机森林与SHAP分析识别和解释最具影响力的几何特征(如对称性、曲率、紧凑性),从而实现模型的可解释性与设计指导性。

链接: https://arxiv.org/abs/2505.12373
作者: Kapil Dev(RMIT University, Melbourne, Australia)
机构: RMIT University (RMIT大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 8 figures, submitted to IEEE Transactions on Visualization and Computer Graphics (TVCG)

点击查看摘要

Abstract:Human aesthetic preferences for 3D shapes are central to industrial design, virtual reality, and consumer product development. However, most computational models of 3D aesthetics lack empirical grounding in large-scale human judgments, limiting their practical relevance. We present a large-scale study of human preferences. We collected 22,301 pairwise comparisons across five object categories (chairs, tables, mugs, lamps, and dining chairs) via Amazon Mechanical Turk. Building on a previously published dataset~\citedev2020learning, we introduce new non-linear modeling and cross-category analysis to uncover the geometric drivers of aesthetic preference. We apply the Bradley-Terry model to infer latent aesthetic scores and use Random Forests with SHAP analysis to identify and interpret the most influential geometric features (e.g., symmetry, curvature, compactness). Our cross-category analysis reveals both universal principles and domain-specific trends in aesthetic preferences. We focus on human interpretable geometric features to ensure model transparency and actionable design insights, rather than relying on black-box deep learning approaches. Our findings bridge computational aesthetics and cognitive science, providing practical guidance for designers and a publicly available dataset to support reproducibility. This work advances the understanding of 3D shape aesthetics through a human-centric, data-driven framework.
zh

[CV-129] STAR: Stage-Wise Attention-Guided Token Reduction for Efficient Large Vision-Language Models Inference

【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在推理过程中因视觉令牌(visual tokens)带来的显著计算开销问题。现有无需训练的令牌剪枝方法通常采用单阶段策略,仅关注视觉自注意力或视觉-文本交叉注意力,忽略了模型中更广泛的的信息流动,导致在高剪枝比例下性能显著下降。该论文提出的解决方案是STAR(Stage-wise Attention-guided token Reduction),其关键在于从全局视角出发,通过两个互补阶段进行注意力引导的令牌剪枝:早期阶段基于视觉自注意力去除冗余的低级特征,后期阶段则通过跨模态注意力丢弃与任务无关的令牌,从而在显著降低计算成本的同时更好地保留任务关键信息。

链接: https://arxiv.org/abs/2505.12359
作者: Yichen Guo,Hanze Li,Zonghao Zhang,Jinhao You,Kai Tang,Xiande Huang
机构: De Artificial Intelligence Lab (德人工智能实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing training-free token pruning methods typically adopt a single-stage strategy, focusing either on visual self-attention or visual-textual cross-attention. However, such localized perspectives often overlook the broader information flow across the model, leading to substantial performance degradation, especially under high pruning ratios. In this work, we propose STAR (Stage-wise Attention-guided token Reduction), a training-free, plug-and-play framework that approaches token pruning from a global perspective. Instead of pruning at a single point, STAR performs attention-guided reduction in two complementary stages: an early-stage pruning based on visual self-attention to remove redundant low-level features, and a later-stage pruning guided by cross-modal attention to discard task-irrelevant tokens. This holistic approach allows STAR to significantly reduce computational cost while better preserving task-critical information. Extensive experiments across multiple LVLM architectures and benchmarks show that STAR achieves strong acceleration while maintaining comparable, and in some cases even improved performance.
zh

[CV-130] Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models

【速读】:该论文试图解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,即生成与输入图像不一致的内容。解决方案的关键在于提出一种无需微调或外部知识库的解码机制——通过层聚合实现层间一致性(Decoding with Inter-layer Consistency via Layer Aggregation, DCLA),该方法通过动态语义参考来纠正语义偏差层,从而有效缓解幻觉现象并提升LVLMs的可靠性与性能。

链接: https://arxiv.org/abs/2505.12343
作者: Kai Tang,Jinhao You,Xiuqi Ge,Hanze Li,Yichen Guo,Xiande Huang
机构: De Artificial Intelligence Lab
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the impressive capabilities of Large Vision-Language Models (LVLMs), they remain susceptible to hallucinations-generating content that is inconsistent with the input image. Existing training-free hallucination mitigation methods often suffer from unstable performance and high sensitivity to hyperparameter settings, limiting their practicality and broader adoption. In this paper, we propose a novel decoding mechanism, Decoding with Inter-layer Consistency via Layer Aggregation (DCLA), which requires no retraining, fine-tuning, or access to external knowledge bases. Specifically, our approach constructs a dynamic semantic reference by aggregating representations from previous layers, and corrects semantically deviated layers to enforce inter-layer consistency. The method allows DCLA to robustly mitigate hallucinations across multiple LVLMs. Experiments on hallucination benchmarks such as MME and POPE demonstrate that DCLA effectively reduces hallucinations while enhancing the reliability and performance of LVLMs.
zh

[CV-131] DIMM: Decoupled Multi-hierarchy Kalman Filter for 3D Object Tracking

【速读】:该论文旨在解决高机动性三维目标跟踪中的状态估计问题,其核心挑战在于目标的状态转移函数变化迅速、不规则且对估计器未知。传统基于交互多模型(IMM)的方法虽通过模型组合提高了估计精度,但存在两个局限:一是模型组合的解空间受限,未能充分考虑目标在不同方向上的多样化运动特性;二是基于观测似然计算的模型组合权重不够准确,受测量不确定性影响较大。论文提出的DIMM框架通过设计一个三维解耦的多层级滤波器组,将传统IMM的超平面解空间扩展为超立方体,从而更全面地描述目标的运动特性;同时,采用可微分自适应融合网络生成更可靠的组合权重矩阵,而非仅依赖观测似然,该网络结合了基于注意力机制的分层奖励双延迟深度确定性策略梯度(TD3)方法。

链接: https://arxiv.org/abs/2505.12340
作者: Jirong Zha,Yuxuan Fan,Kai Li,Han Li,Chen Gao,Xinlei Chen,Yong Li
机构: Shenzhen International Graudate School (深圳国际研究生院); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (Guang Zhou) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:State estimation is challenging for 3D object tracking with high maneuverability, as the target’s state transition function changes rapidly, irregularly, and is unknown to the estimator. Existing work based on interacting multiple model (IMM) achieves more accurate estimation than single-filter approaches through model combination, aligning appropriate models for different motion modes of the target object over time. However, two limitations of conventional IMM remain unsolved. First, the solution space of the model combination is constrained as the target’s diverse kinematic properties in different directions are ignored. Second, the model combination weights calculated by the observation likelihood are not accurate enough due to the measurement uncertainty. In this paper, we propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction, thus increasing the 3D object tracking accuracy. First, DIMM extends the model combination solution space of conventional IMM from a hyperplane to a hypercube by designing a 3D-decoupled multi-hierarchy filter bank, which describes the target’s motion with various-order linear models. Second, DIMM generates more reliable combination weight matrices through a differentiable adaptive fusion network for importance allocation rather than solely relying on the observation likelihood; it contains an attention-based twin delayed deep deterministic policy gradient (TD3) method with a hierarchical reward. Experiments demonstrate that DIMM significantly improves the tracking accuracy of existing state estimation methods by 31.61%~99.23%.
zh

[CV-132] owards Open-world Generalized Deepfake Detection: General Feature Extraction via Unsupervised Domain Adaptation

【速读】:该论文试图解决在开放世界场景下,由于缺乏足够标注数据而导致的未知深度伪造方法检测难题,特别是如何利用有限的标注数据对大规模未标注数据进行高效检测。解决方案的关键在于提出一种新的开放世界深度伪造检测泛化增强训练策略(OWG-DS),通过域距离优化(DDO)模块对齐不同域特征,并结合基于相似性的类别边界分离(SCBS)模块增强样本聚合与类别边界清晰度,同时采用对抗训练机制学习域不变特征,从而提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.12339
作者: Midou Guo,Qilin Yin,Wei Lu,Xiangyang Luo
机构: Sun Yat-sen University (中山大学); State Key Laboratory of Mathematical Engineering and Advanced Computing (数学工程与先进计算国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the development of generative artificial intelligence, new forgery methods are rapidly emerging. Social platforms are flooded with vast amounts of unlabeled synthetic data and authentic data, making it increasingly challenging to distinguish real from fake. Due to the lack of labels, existing supervised detection methods struggle to effectively address the detection of unknown deepfake methods. Moreover, in open world scenarios, the amount of unlabeled data greatly exceeds that of labeled data. Therefore, we define a new deepfake detection generalization task which focuses on how to achieve efficient detection of large amounts of unlabeled data based on limited labeled data to simulate a open world scenario. To solve the above mentioned task, we propose a novel Open-World Deepfake Detection Generalization Enhancement Training Strategy (OWG-DS) to improve the generalization ability of existing methods. Our approach aims to transfer deepfake detection knowledge from a small amount of labeled source domain data to large-scale unlabeled target domain data. Specifically, we introduce the Domain Distance Optimization (DDO) module to align different domain features by optimizing both inter-domain and intra-domain distances. Additionally, the Similarity-based Class Boundary Separation (SCBS) module is used to enhance the aggregation of similar samples to ensure clearer class boundaries, while an adversarial training mechanism is adopted to learn the domain-invariant features. Extensive experiments show that the proposed deepfake detection generalization enhancement training strategy excels in cross-method and cross-dataset scenarios, improving the model’s generalization.
zh

[CV-133] Structureless VIO

【速读】:该论文试图解决视觉惯性里程计(Visual-Inertial Odometry, VIO)中传统结构依赖的问题,即其定位与建图模块紧密耦合所导致的计算效率低下和对地图依赖性强的缺陷。解决方案的关键在于提出一种无结构的VIO框架,通过移除视觉地图(visual map)来实现无需地图的高效定位,从而在保持精度的同时显著提升计算效率。

链接: https://arxiv.org/abs/2505.12337
作者: Junlin Song,Miguel Olivares-Mendez
机构: Space Robotics (SpaceR) Research Group, Int. Centre for Security, Reliability and Trust (SnT), University of Luxembourg (卢森堡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual odometry (VO) is typically considered as a chicken-and-egg problem, as the localization and mapping modules are tightly-coupled. The estimation of visual map relies on accurate localization information. Meanwhile, localization requires precise map points to provide motion constraints. This classical design principle is naturally inherited by visual-inertial odometry (VIO). Efficient localization solution that does not require a map has not been fully investigated. To this end, we propose a novel structureless VIO, where the visual map is removed from the odometry framework. Experimental results demonstrated that, compared to the structure-based VIO baseline, our structureless VIO not only substantially improves computational efficiency but also has advantages in accuracy.
zh

[CV-134] Is Artificial Intelligence Generated Image Detection a Solved Problem?

【速读】:该论文旨在解决当前人工智能生成图像(Artificial Intelligence Generated Image, AIGI)检测器在真实场景中有效性不足的问题。尽管已有多种AIGI检测方法在受控环境中表现出较高的准确率,但其在面对现实世界复杂情况时性能显著下降。论文提出的解决方案关键在于构建AIGIBench基准测试平台,通过模拟多源泛化、图像退化鲁棒性、数据增强敏感性以及测试时预处理影响等四个核心任务,全面评估现有检测器的鲁棒性和泛化能力,从而揭示现有方法的局限性并为未来研究提供指导。

链接: https://arxiv.org/abs/2505.12335
作者: Ziqiang Li,Jiazhen Yan,Ziwen He,Kai Zeng,Weiwei Jiang,Lizhi Xiong,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学); University of Siena (西耶纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Under Review

点击查看摘要

Abstract:The rapid advancement of generative models, such as GANs and Diffusion models, has enabled the creation of highly realistic synthetic images, raising serious concerns about misinformation, deepfakes, and copyright infringement. Although numerous Artificial Intelligence Generated Image (AIGI) detectors have been proposed, often reporting high accuracy, their effectiveness in real-world scenarios remains questionable. To bridge this gap, we introduce AIGIBench, a comprehensive benchmark designed to rigorously evaluate the robustness and generalization capabilities of state-of-the-art AIGI detectors. AIGIBench simulates real-world challenges through four core tasks: multi-source generalization, robustness to image degradation, sensitivity to data augmentation, and impact of test-time pre-processing. It includes 23 diverse fake image subsets that span both advanced and widely adopted image generation techniques, along with real-world samples collected from social media and AI art platforms. Extensive experiments on 11 advanced detectors demonstrate that, despite their high reported accuracy in controlled settings, these detectors suffer significant performance drops on real-world data, limited benefits from common augmentations, and nuanced effects of pre-processing, highlighting the need for more robust detection strategies. By providing a unified and realistic evaluation framework, AIGIBench offers valuable insights to guide future research toward dependable and generalizable AIGI detection.
zh

[CV-135] VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在真实语音克隆(Voice Cloning, VC)中可能被恶意滥用的问题,现有针对传统VC模型的主动防御方法因DMs复杂的生成机制而无法适用。解决方案的关键在于提出一种多维主动防御框架VoiceCloak,通过引入对抗性扰动破坏参考音频,从而混淆说话人身份并降低感知质量。具体而言,VoiceCloak通过扭曲表示学习嵌入以最大化身份差异,并干扰关键的条件引导过程,尤其是注意力上下文,同时利用得分幅度放大和噪声引导语义破坏来降低生成语音的质量,从而有效阻止未经授权的语音克隆。

链接: https://arxiv.org/abs/2505.12332
作者: Qianyue Hu,Junyan Wu,Wei Lu,Xiangyang Luo
机构: Sun Yat-sen University (中山大学); State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou (数学工程与先进计算国家重点实验室,郑州)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak’s outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at this https URL.
zh

[CV-136] Model alignment using inter-modal bridges

【速读】:该论文试图解决跨模态(如文本与视觉)模型重用受限的问题,主要原因是不同模态内部表示的对齐难度较大。现有方法要么需要大量成对训练数据,要么局限于特定领域。其解决方案的关键在于提出一种半监督的模型对齐方法,通过条件流匹配实现跨模态潜在空间的对齐,具体包括在具有跨空间桥接成本的平衡或非平衡最优传输问题中求解,以及利用标记样本进行高效对齐。该方法在标签数据稀缺的情况下仍能保持与端到端训练模型相当的下游任务性能。

链接: https://arxiv.org/abs/2505.12322
作者: Ali Gholamzadeh,Noor Sajid
机构: MPI for Biological Cybernetics (马普所生物认知学); University of Tübingen (图宾根大学); Kempner Institute (肯普纳研究所); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have demonstrated remarkable performance across modalities such as language and vision. However, model reuse across distinct modalities (e.g., text and vision) remains limited due to the difficulty of aligning internal representations. Existing methods require extensive paired training data or are constrained to specific domains. We introduce a semi-supervised approach for model alignment via conditional flow matching. The conditional flow between latent spaces of different modalities (e.g., text-to-image or biological-to-artificial neuronal activity) can be learned in two settings: ( 1 ) solving a (balanced or unbalanced) optimal transport problem with an inter-space bridge cost, and ( 2 ) performing memory-efficient alignment using labelled exemplars. Despite being constrained by the original models’ capacity, our method–under both settings–matches downstream task performance of end-to-end trained models on object recognition and image generation tasks across MNIST, ImageNet, and \citemajaj2015simple datasets, particularly when labelled training data is scarce ( 20% ). Our method provides a data-efficient solution for inter-modal model alignment with minimal supervision.
zh

[CV-137] Improving Out-of-Domain Robustness with Targeted Augmentation in Frequency and Pixel Spaces

【速读】:该论文旨在解决在领域自适应设置下,当源域标签数据与目标域无标签数据来自不同分布时,模型的跨领域(Out-of-Domain, OOD)鲁棒性不足的问题。传统数据增强方法在面对分布偏移时效果有限,而针对性的数据集特定增强方法虽有效但依赖专家知识和大量前期数据分析。该论文提出的解决方案关键在于Frequency-Pixel Connect框架,其通过在频率域和像素域中引入针对性增强,混合源图像和目标图像的幅度谱与像素内容,生成既保留源图像语义结构又引入领域多样性的增强样本,从而提升模型的跨领域泛化能力。该方法具有数据集无关性,适用于更广泛的应用场景。

链接: https://arxiv.org/abs/2505.12317
作者: Ruoqi Wang,Haitao Wang,Shaojie Guo,Qiong Luo
机构: HKUST(GZ)(香港科技大学(广州)); SYSU(中山大学); ECNU(华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Out-of-domain (OOD) robustness under domain adaptation settings, where labeled source data and unlabeled target data come from different distributions, is a key challenge in real-world applications. A common approach to improving OOD robustness is through data augmentations. However, in real-world scenarios, models trained with generic augmentations can only improve marginally when generalized under distribution shifts toward unlabeled target domains. While dataset-specific targeted augmentations can address this issue, they typically require expert knowledge and extensive prior data analysis to identify the nature of the datasets and domain shift. To address these challenges, we propose Frequency-Pixel Connect, a domain-adaptation framework that enhances OOD robustness by introducing a targeted augmentation in both the frequency space and pixel space. Specifically, we mix the amplitude spectrum and pixel content of a source image and a target image to generate augmented samples that introduce domain diversity while preserving the semantic structure of the source image. Unlike previous targeted augmentation methods that are both dataset-specific and limited to the pixel space, Frequency-Pixel Connect is dataset-agnostic, enabling broader and more flexible applicability beyond natural image datasets. We further analyze the effectiveness of Frequency-Pixel Connect by evaluating the performance of our method connecting same-class cross-domain samples while separating different-class examples. We demonstrate that Frequency-Pixel Connect significantly improves cross-domain connectivity and outperforms previous generic methods on four diverse real-world benchmarks across vision, medical, audio, and astronomical domains, and it also outperforms other dataset-specific targeted augmentation methods.
zh

[CV-138] DNOI-4DRO: Deep 4D Radar Odometry with Differentiable Neural-Optimization Iterations

【速读】:该论文旨在解决4D雷达里程计(4D radar odometry)中的精确位姿估计问题,特别是在处理稀疏的4D雷达点云数据时的挑战。其解决方案的关键在于提出了一种结合学习与优化的新型模型DNOI-4DRO,该模型将传统的几何优化与端到端神经网络训练无缝融合,通过创新的可微分神经-优化迭代算子实现运动流的点级估计,并利用基于点运动与三维空间位姿关系构建的成本函数进行位姿优化。此外,设计的双流4D雷达主干网络增强了对稀疏点云的表征能力。

链接: https://arxiv.org/abs/2505.12310
作者: Shouyi Lu,Huanyu Zhou,Guirong Zhuo
机构: Tongji University (同济大学); School of Automotive Studies (汽车学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 16 pages,10 figures

点击查看摘要

Abstract:A novel learning-optimization-combined 4D radar odometry model, named DNOI-4DRO, is proposed in this paper. The proposed model seamlessly integrates traditional geometric optimization with end-to-end neural network training, leveraging an innovative differentiable neural-optimization iteration operator. In this framework, point-wise motion flow is first estimated using a neural network, followed by the construction of a cost function based on the relationship between point motion and pose in 3D space. The radar pose is then refined using Gauss-Newton updates. Additionally, we design a dual-stream 4D radar backbone that integrates multi-scale geometric features and clustering-based class-aware features to enhance the representation of sparse 4D radar point clouds. Extensive experiments on the VoD and Snail-Radar datasets demonstrate the superior performance of our model, which outperforms recent classical and learning-based approaches. Notably, our method even achieves results comparable to A-LOAM with mapping optimization using LiDAR point clouds as input. Our models and code will be publicly released.
zh

[CV-139] mporal-Spectral-Spatial Unified Remote Sensing Dense Prediction

【速读】:该论文旨在解决遥感数据在时间、光谱和空间(TSS)维度上的异质性带来的统一处理难题,以及现有深度学习模型在面对不同数据维度或任务需求时性能下降或模型不兼容的问题。其解决方案的关键在于提出一种名为时空谱统一网络(TSSUN)的架构,该架构通过时空谱统一策略,利用元信息解耦并标准化来自不同TSS配置的输入表示,并统一不同密集预测任务和类别数的输出结构,同时引入局部-全局窗口注意力机制以高效捕捉局部细节与全局依赖关系,从而实现对异质输入的适应和多种密集预测任务的统一建模。

链接: https://arxiv.org/abs/2505.12280
作者: Sijie Zhao,Feng Liu,Xueliang Zhang,Hao Chen,Pengfeng Xiao,Lei Bai
机构: Nanjing University (南京大学); Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, Code link: this https URL

点击查看摘要

Abstract:The proliferation of diverse remote sensing data has spurred advancements in dense prediction tasks, yet significant challenges remain in handling data heterogeneity. Remote sensing imagery exhibits substantial variability across temporal, spectral, and spatial (TSS) dimensions, complicating unified data processing. Current deep learning models for dense prediction tasks, such as semantic segmentation and change detection, are typically tailored to specific input-output configurations. Consequently, variations in data dimensionality or task requirements often lead to significant performance degradation or model incompatibility, necessitating costly retraining or fine-tuning efforts for different application scenarios. This paper introduces the Temporal-Spectral-Spatial Unified Network (TSSUN), a novel architecture designed for unified representation and modeling of remote sensing data across diverse TSS characteristics and task types. TSSUN employs a Temporal-Spectral-Spatial Unified Strategy that leverages meta-information to decouple and standardize input representations from varied temporal, spectral, and spatial configurations, and similarly unifies output structures for different dense prediction tasks and class numbers. Furthermore, a Local-Global Window Attention mechanism is proposed to efficiently capture both local contextual details and global dependencies, enhancing the model’s adaptability and feature extraction capabilities. Extensive experiments on multiple datasets demonstrate that a single TSSUN model effectively adapts to heterogeneous inputs and unifies various dense prediction tasks. The proposed approach consistently achieves or surpasses state-of-the-art performance, highlighting its robustness and generalizability for complex remote sensing applications without requiring task-specific modifications.
zh

[CV-140] Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning

【速读】:该论文试图解决如何通过视觉驱动的控制框架实现类人行为的复杂任务,如物体搜索、抓取和操作等,而无需依赖精确的状态信息。其解决方案的关键在于提出感知-接口(perception-as-interface)范式,即仅使用第一视角视觉输入来指定任务,从而学习一个通用策略以执行多种家庭任务,并在强化学习训练中产生类似人类的主动搜索等行为。

链接: https://arxiv.org/abs/2505.12278
作者: Zhengyi Luo,Chen Tessler,Toru Lin,Ye Yuan,Tairan He,Wenli Xiao,Yunrong Guo,Gal Chechik,Kris Kitani,Linxi Fan,Yuke Zhu
机构: Nvidia; Carnegie Mellon University; University of California, Berkeley
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Human behavior is fundamentally shaped by visual perception – our ability to interact with the world depends on actively gathering relevant information and adapting our movements accordingly. Behaviors like searching for objects, reaching, and hand-eye coordination naturally emerge from the structure of our sensory system. Inspired by these principles, we introduce Perceptive Dexterous Control (PDC), a framework for vision-driven dexterous whole-body control with simulated humanoids. PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues, without relying on privileged state information (e.g., 3D object positions and geometries). This perception-as-interface paradigm enables learning a single policy to perform multiple household tasks, including reaching, grasping, placing, and articulated object manipulation. We also show that training from scratch with reinforcement learning can produce emergent behaviors such as active search. These results demonstrate how vision-driven control and complex tasks induce human-like behaviors and can serve as the key ingredients in closing the perception-action loop for animation, robotics, and embodied AI.
zh

[CV-141] Context-Aware Autoregressive Models for Multi-Conditional Image Generation

【速读】:该论文旨在解决多条件图像生成任务中如何高效融合多种模态信息的问题,特别是在保持空间对齐和增强不同条件类型之间区分度的同时,降低计算复杂度。其解决方案的关键在于提出一种名为ContextAR的灵活且有效的框架,该框架通过将多种条件(如Canny边缘、深度图、姿态等)直接嵌入到统一的标记序列中,保留模态特定语义,并引入混合位置编码(融合旋转位置编码与可学习位置编码)以维持空间对齐和提升条件类型间的区分能力,同时设计条件上下文感知注意力机制以减少计算复杂度并保持条件内部的有效感知。

链接: https://arxiv.org/abs/2505.12274
作者: Yixiao Chen,Zhiyuan Ma,Guoli Jia,Che Jiang,Jianjun Li,Bowen Zhou
机构: Tsinghua University (清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence–offering a concise solution for multi-conditional image generation tasks. In this work, we propose \textbfContextAR , a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces computational complexity while preserving effective intra-condition perception. Without any fine-tuning, ContextAR supports arbitrary combinations of conditions during inference time. Experimental results demonstrate the powerful controllability and versatility of our approach, and show that the competitive perpormance than diffusion-based multi-conditional control approaches the existing autoregressive baseline across diverse multi-condition driven scenarios. Project page: \hrefthis https URLthis https URL.
zh

[CV-142] PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement

【速读】:该论文旨在解决视频增强任务中基于Transformer的方法在边缘设备部署时面临的计算与内存需求过高的问题。现有量化方法在直接应用于视频增强任务时会导致性能下降和细节丢失,主要受限于无法在不同帧间分配差异化的表示能力以及过度依赖全精度教师模型。该论文提出的解决方案关键在于设计了一种新颖的量化方法——面向视频增强的渐进多帧量化(PMQ-VE),其核心包括基于回溯的多帧量化(BMFQ)和渐进多教师蒸馏(PMTD),通过分阶段优化量化边界和引入多教师协同训练机制,有效提升了低比特模型的性能与质量。

链接: https://arxiv.org/abs/2505.12266
作者: ZhanFeng Feng,Long Peng,Xin Di,Yong Guo,Wenbo Li,Yulun Zhang,Renjing Pei,Yang Wang,Yang Cao,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Huawei Technologies Co., Ltd. (华为技术有限公司); Shanghai Jiao Tong University (上海交通大学); Chang’an University (长安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences by leveraging temporal information from multiple frames, which are widely used in streaming video processing, surveillance, and generation. Although numerous Transformer-based enhancement methods have achieved impressive performance, their computational and memory demands hinder deployment on edge devices. Quantization offers a practical solution by reducing the bit-width of weights and activations to improve efficiency. However, directly applying existing quantization methods to video enhancement tasks often leads to significant performance degradation and loss of fine details. This stems from two limitations: (a) inability to allocate varying representational capacity across frames, which results in suboptimal dynamic range adaptation; (b) over-reliance on full-precision teachers, which limits the learning of low-bit student models. To tackle these challenges, we propose a novel quantization method for video enhancement: Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE). This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD). BMFQ utilizes a percentile-based initialization and iterative search with pruning and backtracking for robust clipping bounds. PMTD employs a progressive distillation strategy with both full-precision and multiple high-bit (INT) teachers to enhance low-bit models’ capacity and quality. Extensive experiments demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance across multiple tasks and this http URL code will be made publicly available at: this https URL.
zh

[CV-143] MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark

【速读】:该论文旨在解决现有视觉场景识别(Visual Place Recognition, VPR)数据集主要依赖车载图像、缺乏多模态多样性以及在非西方城市环境中对密集混合用途街道空间表征不足的问题。其解决方案的关键在于引入MMS-VPR,这是一个大规模的多模态数据集,用于复杂行人专用环境中的街道级场景识别。该数据集包含78,575张标注图像和2,512段视频片段,覆盖了中国成都一个约70,800平方米的开放商业区的207个地点,并通过精确的GPS坐标、时间戳和文本元数据进行标注。此外,MMS-VPR构建了一个具有125条边、81个节点和1个子图的内在空间图,支持结构感知的场景识别,并定义了两个应用特定子集以支持细粒度和基于图的评估任务。

链接: https://arxiv.org/abs/2505.12254
作者: Yiwei Ou,Xiaobin Ren,Ronggui Sun,Guansong Gao,Ziyi Jiang,Kaiqi Zhao,Manfredo Manfredini
机构: The University of Auckland (奥克兰大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, lack multimodal diversity and underrepresent dense, mixed-use street-level spaces, especially in non-Western urban contexts. To address these gaps, we introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in complex, pedestrian-only environments. The dataset comprises 78,575 annotated images and 2,512 video clips captured across 207 locations in a ~70,800 \mathrmm^2 open-air commercial district in Chengdu, China. Each image is labeled with precise GPS coordinates, timestamp, and textual metadata, and covers varied lighting conditions, viewpoints, and timeframes. MMS-VPR follows a systematic and replicable data collection protocol with minimal device requirements, lowering the barrier for scalable dataset creation. Importantly, the dataset forms an inherent spatial graph with 125 edges, 81 nodes, and 1 subgraph, enabling structure-aware place recognition. We further define two application-specific subsets – Dataset_Edges and Dataset_Points – to support fine-grained and graph-based evaluation tasks. Extensive benchmarks using conventional VPR models, graph neural networks, and multimodal baselines show substantial improvements when leveraging multimodal and structural cues. MMS-VPR facilitates future research at the intersection of computer vision, geospatial understanding, and multimodal reasoning. The dataset is publicly available at this https URL.
zh

[CV-144] LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在物理世界中因缺乏空间表示而难以处理动态场景的问题。现有方法主要通过将3D位置作为固定空间提示嵌入视觉特征中来表征场景,但这些方法仅能理解静态背景,无法捕捉时间变化的动态物体。论文提出的解决方案关键在于引入一种新颖的时空提示(spatiotemporal prompt),通过将3D位置和1D时间编码为动态感知的4D坐标嵌入,以增强对动态场景的理解能力。此外,通过将时空提示嵌入到视觉特征中,并对齐视觉时空嵌入与语言嵌入,LMMs能够同时理解静态背景和动态物体的空间与时间特性。

链接: https://arxiv.org/abs/2505.12253
作者: Hanyu Zhou,Gim Hee Lee
机构: School of Computing, National University of Singapore (计算学院,新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite achieving significant progress in 2D image understanding, large multimodal models (LMMs) struggle in the physical world due to the lack of spatial representation. Typically, existing 3D LMMs mainly embed 3D positions as fixed spatial prompts within visual features to represent the scene. However, these methods are limited to understanding the static background and fail to capture temporally varying dynamic objects. In this paper, we propose LLaVA-4D, a general LMM framework with a novel spatiotemporal prompt for visual representation in 4D scene understanding. The spatiotemporal prompt is generated by encoding 3D position and 1D time into a dynamic-aware 4D coordinate embedding. Moreover, we demonstrate that spatial and temporal components disentangled from visual features are more effective in distinguishing the background from objects. This motivates embedding the 4D spatiotemporal prompt into these features to enhance the dynamic scene representation. By aligning visual spatiotemporal embeddings with language embeddings, LMMs gain the ability to understand both spatial and temporal characteristics of static background and dynamic objects in the physical world. Additionally, we construct a 4D vision-language dataset with spatiotemporal coordinate annotations for instruction fine-tuning LMMs. Extensive experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
zh

[CV-145] SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis

【速读】:该论文试图解决现有多模态医学图像融合方法在特征提取和融合策略制定中过度依赖计算机视觉标准,而忽视了医学图像中固有的丰富语义信息的问题。其解决方案的关键在于首次将医学先验知识引入融合过程,通过构建公开的多模态医学图像-文本数据集,利用BiomedGPT生成的文本描述与图像特征在高维空间中进行语义对齐,并通过基于交叉注意力的线性变换自动映射文本与视觉特征之间的关系,从而实现更全面的学习。此外,还设计了医学语义损失函数以增强对源图像中文本线索的保留。

链接: https://arxiv.org/abs/2505.12251
作者: Haozhe Xiang,Han Zhang,Yu Cheng,Xiongwen Quan,Wanwan Huang
机构: Hunan Agricultural University (湖南农业大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal medical image fusion plays a crucial role in medical diagnosis by integrating complementary information from different modalities to enhance image readability and clinical applicability. However, existing methods mainly follow computer vision standards for feature extraction and fusion strategy formulation, overlooking the rich semantic information inherent in medical images. To address this limitation, we propose a novel semantic-guided medical image fusion approach that, for the first time, incorporates medical prior knowledge into the fusion process. Specifically, we construct a publicly available multimodal medical image-text dataset, upon which text descriptions generated by BiomedGPT are encoded and semantically aligned with image features in a high-dimensional space via a semantic interaction alignment module. During this process, a cross attention based linear transformation automatically maps the relationship between textual and visual features to facilitate comprehensive learning. The aligned features are then embedded into a text-injection module for further feature-level fusion. Unlike traditional methods, we further generate diagnostic reports from the fused images to assess the preservation of medical information. Additionally, we design a medical semantic loss function to enhance the retention of textual cues from the source images. Experimental results on test datasets demonstrate that the proposed method achieves superior performance in both qualitative and quantitative evaluations while preserving more critical medical information.
zh

[CV-146] SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving

【速读】:该论文旨在解决在线场景感知与拓扑推理在长距离或遮挡场景下的局限性,特别是在依赖车载传感器的自主车辆中,由于传感器固有约束导致的性能不足问题。其解决方案的关键在于提出一种标准清晰度(Standard-Definition, SD)地图增强的场景感知与拓扑推理(SEPT)框架,通过有效融合SD地图作为先验知识到现有的感知与推理流程中,具体包括一种结合栅格化与矢量表示的混合特征融合策略,以及利用SD地图特性设计的辅助交叉口感知关键点检测任务,从而提升整体场景理解性能。

链接: https://arxiv.org/abs/2505.12246
作者: Muleilan Pei,Jiayao Shan,Peiliang Li,Jieqi Shi,Jing Huo,Yang Gao,Shaojie Shen
机构: Hong Kong University of Science and Technology (香港科技大学); Zhuoyu Technology (卓越科技); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Online scene perception and topology reasoning are critical for autonomous vehicles to understand their driving environments, particularly for mapless driving systems that endeavor to reduce reliance on costly High-Definition (HD) maps. However, recent advances in online scene understanding still face limitations, especially in long-range or occluded scenarios, due to the inherent constraints of onboard sensors. To address this challenge, we propose a Standard-Definition (SD) Map Enhanced scene Perception and Topology reasoning (SEPT) framework, which explores how to effectively incorporate the SD map as prior knowledge into existing perception and reasoning pipelines. Specifically, we introduce a novel hybrid feature fusion strategy that combines SD maps with Bird’s-Eye-View (BEV) features, considering both rasterized and vectorized representations, while mitigating potential misalignment between SD maps and BEV feature spaces. Additionally, we leverage the SD map characteristics to design an auxiliary intersection-aware keypoint detection task, which further enhances the overall scene understanding performance. Experimental results on the large-scale OpenLane-V2 dataset demonstrate that by effectively integrating SD map priors, our framework significantly improves both scene perception and topology reasoning, outperforming existing methods by a substantial margin.
zh

[CV-147] From Shots to Stories: LLM -Assisted Video Editing with Unified Language Representations

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在视频编辑中的应用问题,特别是如何将视觉信息与基于语言的推理有效结合。其解决方案的关键在于引入L-Storyboard,这是一种将离散视频片段转换为适合LLMs处理的结构化语言描述的中间表示,从而实现视觉信息与语言描述之间的稳健映射。此外,针对发散性任务输出的不稳定性,论文提出了StoryFlow策略,将多路径推理过程转化为收敛性选择机制,以提升任务准确性和逻辑一致性。

链接: https://arxiv.org/abs/2505.12237
作者: Yuzhi Li,Haojun Xu,Fang Tian
机构: Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable reasoning and generalization capabilities in video understanding; however, their application in video editing remains largely underexplored. This paper presents the first systematic study of LLMs in the context of video editing. To bridge the gap between visual information and language-based reasoning, we introduce L-Storyboard, an intermediate representation that transforms discrete video shots into structured language descriptions suitable for LLM processing. We categorize video editing tasks into Convergent Tasks and Divergent Tasks, focusing on three core tasks: Shot Attributes Classification, Next Shot Selection, and Shot Sequence Ordering. To address the inherent instability of divergent task outputs, we propose the StoryFlow strategy, which converts the divergent multi-path reasoning process into a convergent selection mechanism, effectively enhancing task accuracy and logical coherence. Experimental results demonstrate that L-Storyboard facilitates a more robust mapping between visual information and language descriptions, significantly improving the interpretability and privacy protection of video editing tasks. Furthermore, StoryFlow enhances the logical consistency and output stability in Shot Sequence Ordering, underscoring the substantial potential of LLMs in intelligent video editing.
zh

[CV-148] NOFT: Test-Time Noise Finetune via Information Bottleneck for Highly Correlated Asset Creation

【速读】:该论文旨在解决文本到图像(T2I)和图像到图像(I2I)生成中内容保真度与可控性变化之间的权衡问题。现有方法通过外部信号或扩散特征操作实现高保真可控编辑,并依赖不同噪声潜在变量来提升多样性,但未充分挖掘压缩上下文潜在空间中的信息。论文提出了一种可插拔的噪声微调(NOFT)模块,通过最优传输信息瓶颈(OT-IB)对种子噪声或逆向噪声进行微调,仅需约14K个可训练参数和10分钟训练时间,能够有效生成高度相关且多样的图像,从而在保持拓扑与纹理对齐的前提下实现高质量图像变体生成。

链接: https://arxiv.org/abs/2505.12235
作者: Jia Li,Nan Gao,Huaibo Huang,Ran He
机构: CASIA; NLPR, CASIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The diffusion model has provided a strong tool for implementing text-to-image (T2I) and image-to-image (I2I) generation. Recently, topology and texture control are popular explorations, e.g., ControlNet, IP-Adapter, Ctrl-X, and DSG. These methods explicitly consider high-fidelity controllable editing based on external signals or diffusion feature manipulations. As for diversity, they directly choose different noise latents. However, the diffused noise is capable of implicitly representing the topological and textural manifold of the corresponding image. Moreover, it’s an effective workbench to conduct the trade-off between content preservation and controllable variations. Previous T2I and I2I diffusion works do not explore the information within the compressed contextual latent. In this paper, we first propose a plug-and-play noise finetune NOFT module employed by Stable Diffusion to generate highly correlated and diverse images. We fine-tune seed noise or inverse noise through an optimal-transported (OT) information bottleneck (IB) with around only 14K trainable parameters and 10 minutes of training. Our test-time NOFT is good at producing high-fidelity image variations considering topology and texture alignments. Comprehensive experiments demonstrate that NOFT is a powerful general reimagine approach to efficiently fine-tune the 2D/3D AIGC assets with text or image guidance.
zh

[CV-149] From Low Field to High Value: Robust Cortical Mapping from Low-Field MRI

【速读】:该论文旨在解决低场磁共振成像(LF-MRI)在皮层表面三维重建与形态学分析中的挑战,尤其是在信号噪声比和分辨率较低的情况下,传统工具难以有效处理。其解决方案的关键在于提出一种基于机器学习的方法,采用3D U-Net网络,在合成LF-MRI数据上进行训练,以预测皮层表面的符号距离函数,并通过几何处理确保拓扑准确性,从而实现无需重新训练即可直接应用于不同对比度和分辨率的便携式LF-MRI数据。

链接: https://arxiv.org/abs/2505.12228
作者: Karthik Gopinath,Annabel Sorby-Adams,Jonathan W. Ramirez,Dina Zemlyanker,Jennifer Guo,David Hunt,Christine L. Mac Donald,C. Dirk Keene,Timothy Coalson,Matthew F. Glasser,David Van Essen,Matthew S. Rosen,Oula Puonti,W. Taylor Kimberly,Juan Eugenio Iglesias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 32 pages

点击查看摘要

Abstract:Three-dimensional reconstruction of cortical surfaces from MRI for morphometric analysis is fundamental for understanding brain structure. While high-field MRI (HF-MRI) is standard in research and clinical settings, its limited availability hinders widespread use. Low-field MRI (LF-MRI), particularly portable systems, offers a cost-effective and accessible alternative. However, existing cortical surface analysis tools are optimized for high-resolution HF-MRI and struggle with the lower signal-to-noise ratio and resolution of LF-MRI. In this work, we present a machine learning method for 3D reconstruction and analysis of portable LF-MRI across a range of contrasts and resolutions. Our method works “out of the box” without retraining. It uses a 3D U-Net trained on synthetic LF-MRI to predict signed distance functions of cortical surfaces, followed by geometric processing to ensure topological accuracy. We evaluate our method using paired HF/LF-MRI scans of the same subjects, showing that LF-MRI surface reconstruction accuracy depends on acquisition parameters, including contrast type (T1 vs T2), orientation (axial vs isotropic), and resolution. A 3mm isotropic T2-weighted scan acquired in under 4 minutes, yields strong agreement with HF-derived surfaces: surface area correlates at r=0.96, cortical parcellations reach Dice=0.98, and gray matter volume achieves r=0.93. Cortical thickness remains more challenging with correlations up to r=0.70, reflecting the difficulty of sub-mm precision with 3mm voxels. We further validate our method on challenging postmortem LF-MRI, demonstrating its robustness. Our method represents a step toward enabling cortical surface analysis on portable LF-MRI. Code is available at this https URL
zh

[CV-150] Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models

【速读】:该论文试图解决传统高光谱成像(hyperspectral imaging, HSI)数据集在遥感应用中仅专注于分类任务,缺乏对高光谱图像深层次语义理解的问题。其解决方案的关键在于构建HyperCap,这是首个大规模的高光谱描述生成数据集,通过将光谱数据与像素级文本标注相结合,提升模型在分类和特征提取等任务中的性能,从而为高级遥感应用提供有价值的资源。

链接: https://arxiv.org/abs/2505.12217
作者: Aryan Das,Tanishq Rachamalla,Pravendra Singh,Koushik Biswas,Vinay Kumar Verma,Swalpa Kumar Roy
机构: VIT, Bhopal; SAHE, Andhra Pradesh; IIT Roorkee; IIIT Delhi; Amazon India; Alipurduar Govt. Engg. and Mngt. College
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) datasets that focus solely on classification tasks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding of hyperspectral imagery. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field.
zh

[CV-151] Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

【速读】:该论文旨在解决农业遥感(Agricultural Remote Sensing, RS)领域缺乏全面基准评测体系的问题,现有基准在数据集场景多样性及任务设计复杂性方面存在明显不足。其解决方案的关键在于提出AgroMind,这是一个涵盖四个任务维度(空间感知、目标理解、场景理解和场景推理)的综合性农业遥感基准,包含13种任务类型,并通过整合八个公开数据集和一个私有农田地块数据集构建了高质量的评估集,包含25,026个问答对和15,556张图像,从而为大型多模态模型(Large Multimodal Models, LMMs)提供了一个标准化的评估框架。

链接: https://arxiv.org/abs/2505.12207
作者: Qingmei Li,Yang Zhang,Zurong Mai,Yuhang Chen,Shuohong Lou,Henglian Huang,Jiarui Zhang,Zhiwei Zhang,Yibin Wen,Weijia Li,Haohuan Fu,Jianxi Huang,Juepeng Zheng
机构: Tsinghua University (清华大学); Sun Yat-Sen University (中山大学); China Agricultural University (中国农业大学); Southwest Jiaotong University (西南交通大学); National Supercomputing Center in Shenzhen (深圳国家超算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating eight public datasets and one private farmland plot dataset, containing 25,026 QA pairs and 15,556 images. The pipeline begins with multi-source data preprocessing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 18 open-source LMMs and 3 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work. Data and code can be accessed at this https URL.
zh

[CV-152] Road Segmentation for ADAS/AD Applications

【速读】:该论文旨在解决自动驾驶和高级驾驶辅助系统(ADAS)中准确的道路分割问题,以实现复杂环境下的有效导航。研究的核心在于探讨模型架构和数据集选择对分割性能的影响,其关键解决方案是通过在Comma10k数据集上训练改进的VGG-16模型以及在KITTI Road数据集上训练改进的U-Net模型,分析不同架构与数据集组合下的分割效果,并通过F1-score、平均交并比等指标评估模型性能。

链接: https://arxiv.org/abs/2505.12206
作者: Mathanesh Vellingiri Ramasamy,Dimas Rizky Kurniasalim
机构: Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate road segmentation is essential for autonomous driving and ADAS, enabling effective navigation in complex environments. This study examines how model architecture and dataset choice affect segmentation by training a modified VGG-16 on the Comma10k dataset and a modified U-Net on the KITTI Road dataset. Both models achieved high accuracy, with cross-dataset testing showing VGG-16 outperforming U-Net despite U-Net being trained for more epochs. We analyze model performance using metrics such as F1-score, mean intersection over union, and precision, discussing how architecture and dataset impact results.
zh

[CV-153] CompBench: Benchmarking Complex Instruction-guided Image Editing

【速读】:该论文旨在解决现有指令引导图像编辑基准在任务复杂性上过于简化、缺乏全面且细粒度指令的问题。其关键解决方案是提出一个基于大语言模型与人类协作的框架,并引入指令解耦策略,将编辑意图分解为位置、外观、动态和对象四个关键维度,从而提升指令与复杂编辑需求之间的对齐度。

链接: https://arxiv.org/abs/2505.12200
作者: Bohan Jia,Wenxuan Huang,Yuntian Tang,Junbo Qiao,Jincheng Liao,Shaosheng Cao,Fei Zhao,Zhaopeng Feng,Zhouhong Gu,Zhenfei Yin,Lei Bai,Wanli Ouyang,Lin Chen,Fei Zhao,Zihan Wang,Yuan Xie,Shaohui Lin
机构: East China Normal University (华东师范大学); The Chinese University of Hong Kong (香港中文大学); Xiaohongshu Inc. (小红书公司); Zhejiang University (浙江大学); Fudan University (复旦大学); University of Oxford (牛津大学); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While real-world applications increasingly demand intricate scene manipulation, existing instruction-guided image editing benchmarks often oversimplify task complexity and lack comprehensive, fine-grained instructions. To bridge this gap, we introduce, a large-scale benchmark specifically designed for complex instruction-guided image editing. CompBench features challenging editing scenarios that incorporate fine-grained instruction following, spatial and contextual reasoning, thereby enabling comprehensive evaluation of image editing models’ precise manipulation capabilities. To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. Furthermore, we propose an instruction decoupling strategy that disentangles editing intents into four key dimensions: location, appearance, dynamics, and objects, ensuring closer alignment between instructions and complex editing requirements. Extensive evaluations reveal that CompBench exposes fundamental limitations of current image editing models and provides critical insights for the development of next-generation instruction-guided image editing systems.
zh

[CV-154] Always Clear Depth: Robust Monocular Depth Estimation under Adverse Weather

【速读】:该论文旨在解决单目深度估计在恶劣天气条件下性能下降的问题,这主要是由于领域偏移和场景信息提取困难所致。其解决方案的关键在于从高质量训练数据生成和领域自适应的角度出发,提出了一种名为ACDepth的方法。该方法通过引入一步扩散模型生成模拟恶劣天气条件的样本,构建多元退化数据集,并利用LoRA适配器优化生成权重以确保样本质量;同时结合循环一致性损失和对抗训练以保证场景内容的真实性和自然性。此外,通过多粒度知识蒸馏策略(MKD)和序数指导蒸馏机制(OGD),使学生网络能够从教师模型和预训练的Depth Anything V2中学习到与退化无关的场景信息,并聚焦于不确定区域,从而提升深度估计的精度。

链接: https://arxiv.org/abs/2505.12199
作者: Kui Jiang,Jing Cao,Zhaocheng Yu,Junjun Jiang,Jingchun Zhou
机构: Harbin Institute of Technology (哈尔滨工业大学); Dalian Maritime University (大连海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging domain shifts and difficulties in extracting scene information. To address this issue, we present a robust monocular depth estimation method called \textbfACDepth from the perspective of high-quality training data generation and domain adaptation. Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training. To ensure the quality of the generated degradation samples, we employ LoRA adapters to fine-tune the generation weights of diffusion model. Additionally, we integrate circular consistency loss and adversarial training to guarantee the fidelity and naturalness of the scene contents. Furthermore, we elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2. This strategy guides the student model in learning degradation-agnostic scene information from various degradation inputs. In particular, we introduce an ordinal guidance distillation mechanism (OGD) that encourages the network to focus on uncertain regions through differential ranking, leading to a more precise depth estimation. Experimental results demonstrate that our ACDepth surpasses md4all-DD by 2.50% for night scene and 2.61% for rainy scene on the nuScenes dataset in terms of the absRel metric.
zh

[CV-155] Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum

【速读】:该论文旨在解决在噪声数据上应用自监督学习(Self-Supervised Learning, SSL)的挑战,尤其是在缺乏干净标签或预处理步骤的情况下。其关键解决方案是提出一种完全自监督的框架,通过在噪声数据上训练SSL去噪器,随后利用该去噪器构建从去噪数据到噪声数据的课程学习(denoised-to-noisy data curriculum),用于预训练SSL主干网络(如DINOv2),同时结合教师引导的正则化策略,将噪声嵌入锚定到对应的去噪嵌入上,从而增强模型的噪声鲁棒性。该方法无需在推理或微调阶段使用去噪器,简化了部署流程。

链接: https://arxiv.org/abs/2505.12191
作者: Wenquan Lu,Jiaqi Zhang,Hugues Van Assel,Randall Balestriero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ( \sigma=255 , SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at this https URL.
zh

[CV-156] SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds

【速读】:该论文试图解决传统分割评估指标在处理实例分割任务时存在的局限性,即依赖于二值决策逻辑,将预测结果简单地划分为正确或错误,忽视了不同错误类型的质的差异以及模型改进的渐进性。其解决方案的关键在于提出SoftPQ,一种灵活且可解释的实例分割评估指标,通过引入可调节的上下界IoU阈值定义部分匹配区域,并应用非线性惩罚函数处理模糊或碎片化的预测,从而将评估转化为一个连续的评分体系,而非二分类问题。

链接: https://arxiv.org/abs/2505.12155
作者: Ranit Karmakar,Simon F. Nørrelykke
机构: Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Segmentation evaluation metrics traditionally rely on binary decision logic: predictions are either correct or incorrect, based on rigid IoU thresholds. Detection–based metrics such as F1 and mAP determine correctness at the object level using fixed overlap cutoffs, while overlap–based metrics like Intersection over Union (IoU) and Dice operate at the pixel level, often overlooking instance–level structure. Panoptic Quality (PQ) attempts to unify detection and segmentation assessment, but it remains dependent on hard-threshold matching–treating predictions below the threshold as entirely incorrect. This binary framing obscures important distinctions between qualitatively different errors and fails to reward gradual model improvements. We propose SoftPQ, a flexible and interpretable instance segmentation metric that redefines evaluation as a graded continuum rather than a binary classification. SoftPQ introduces tunable upper and lower IoU thresholds to define a partial matching region and applies a sublinear penalty function to ambiguous or fragmented predictions. These extensions allow SoftPQ to exhibit smoother score behavior, greater robustness to structural segmentation errors, and more informative feedback for model development and evaluation. Through controlled perturbation experiments, we show that SoftPQ captures meaningful differences in segmentation quality that existing metrics overlook, making it a practical and principled alternative for both benchmarking and iterative model refinement.
zh

[CV-157] Learning to Highlight Audio by Watching Movies CVPR2025

【速读】:该论文试图解决视频与音频内容在感知上的不协调问题,即视觉和听觉显著性之间的脱节。为了解决这一问题,作者提出了一项新的任务:视觉引导的声学强调(visually-guided acoustic highlighting),旨在通过视频内容指导音频的处理,以实现更和谐的音画体验。该解决方案的关键在于提出了一种基于Transformer的多模态框架,能够有效融合视觉和音频信息,并通过引入一个新的数据集——muddy mix dataset,利用电影中精细制作的音视频内容进行训练,从而实现对音频的高质量增强。

链接: https://arxiv.org/abs/2505.12154
作者: Chao Huang,Ruohan Gao,J. M. F. Tsang,Jan Kurcius,Cagdas Bilen,Chenliang Xu,Anurag Kumar,Sanjeel Parekh
机构: University of Rochester(罗切斯特大学); University of Maryland, College Park(马里兰大学学院公园分校); Meta Reality Labs Research(元宇宙现实实验室研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Recent years have seen a significant increase in video content creation and consumption. Crafting engaging content requires the careful curation of both visual and audio elements. While visual cue curation, through techniques like optimal viewpoint selection or post-editing, has been central to media production, its natural counterpart, audio, has not undergone equivalent advancements. This often results in a disconnect between visual and acoustic saliency. To bridge this gap, we introduce a novel task: visually-guided acoustic highlighting, which aims to transform audio to deliver appropriate highlighting effects guided by the accompanying video, ultimately creating a more harmonious audio-visual experience. We propose a flexible, transformer-based multimodal framework to solve this task. To train our model, we also introduce a new dataset – the muddy mix dataset, leveraging the meticulous audio and video crafting found in movies, which provides a form of free supervision. We develop a pseudo-data generation process to simulate poorly mixed audio, mimicking real-world scenarios through a three-step process – separation, adjustment, and remixing. Our approach consistently outperforms several baselines in both quantitative and subjective evaluation. We also systematically study the impact of different types of contextual guidance and difficulty levels of the dataset. Our project page is here: this https URL.
zh

[CV-158] Keypoints as Dynamic Centroids for Unified Human Pose and Segmentation

【速读】:该论文旨在解决人体姿态估计与实例级分割中的动态运动挑战,特别是在重叠关节或快速变化姿态的场景下,现有方法在结合关键点热图与分割掩码时表现不佳。其解决方案的关键在于提出Keypoints as Dynamic Centroid (KDC),这是一种基于中心点的统一人体姿态估计与实例级分割表示方法。KDC采用自底向上的范式生成关键点热图,并通过引入KeyCentroids提升关键点检测与置信度评分;同时利用高置信度关键点作为嵌入空间中的动态中心点(MaskCentroids),实现快速像素聚类以应对实时环境中的快速身体运动。

链接: https://arxiv.org/abs/2505.12130
作者: Niaz Ahmad,Jawad Khan,Kang G. Shin,Youngmoon Lee,Guanghui Wang
机构: Toronto Metropolitan University (多伦多都会大学); Gachon University (伽耶大学); University of Michigan (密歇根大学); Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The dynamic movement of the human body presents a fundamental challenge for human pose estimation and body segmentation. State-of-the-art approaches primarily rely on combining keypoint heatmaps with segmentation masks but often struggle in scenarios involving overlapping joints or rapidly changing poses during instance-level segmentation. To address these limitations, we propose Keypoints as Dynamic Centroid (KDC), a new centroid-based representation for unified human pose estimation and instance-level segmentation. KDC adopts a bottom-up paradigm to generate keypoint heatmaps for both easily distinguishable and complex keypoints and improves keypoint detection and confidence scores by introducing KeyCentroids using a keypoint disk. It leverages high-confidence keypoints as dynamic centroids in the embedding space to generate MaskCentroids, allowing for swift clustering of pixels to specific human instances during rapid body movements in live environments. Our experimental evaluations on the CrowdPose, OCHuman, and COCO benchmarks demonstrate KDC’s effectiveness and generalizability in challenging scenarios in terms of both accuracy and runtime performance. The implementation is available at: this https URL.
zh

[CV-159] Behind the Screens: Uncovering Bias in AI-Driven Video Interview Assessments Using Counterfactuals

【速读】:该论文试图解决AI增强型人格评估在招聘决策中可能加剧的偏见问题,尤其是在训练数据中存在的偏见可能导致基于性别、种族和年龄等受保护属性的歧视性结果。解决方案的关键在于提出一种基于反事实的框架,利用生成对抗网络(GANs)生成改变受保护属性的申请人反事实表示,从而在不访问底层模型的情况下进行公平性分析,并支持多模态评估(包括视觉、音频和文本特征)。该方法为商业AI招聘平台的公平性审计提供了一种可扩展的工具,尤其适用于训练数据和模型内部不可访问的黑箱场景。

链接: https://arxiv.org/abs/2505.12114
作者: Dena F. Mujtaba,Nihar R. Mahapatra
机构: Michigan State University (密歇根州立大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:AI-enhanced personality assessments are increasingly shaping hiring decisions, using affective computing to predict traits from the Big Five (OCEAN) model. However, integrating AI into these assessments raises ethical concerns, especially around bias amplification rooted in training data. These biases can lead to discriminatory outcomes based on protected attributes like gender, ethnicity, and age. To address this, we introduce a counterfactual-based framework to systematically evaluate and quantify bias in AI-driven personality assessments. Our approach employs generative adversarial networks (GANs) to generate counterfactual representations of job applicants by altering protected attributes, enabling fairness analysis without access to the underlying model. Unlike traditional bias assessments that focus on unimodal or static data, our method supports multimodal evaluation-spanning visual, audio, and textual features. This comprehensive approach is particularly important in high-stakes applications like hiring, where third-party vendors often provide AI systems as black boxes. Applied to a state-of-the-art personality prediction model, our method reveals significant disparities across demographic groups. We also validate our framework using a protected attribute classifier to confirm the effectiveness of our counterfactual generation. This work provides a scalable tool for fairness auditing of commercial AI hiring platforms, especially in black-box settings where training data and model internals are inaccessible. Our results highlight the importance of counterfactual approaches in improving ethical transparency in affective computing.
zh

[CV-160] EarthSynth: Generating Informative Earth Observation with Diffusion Models

【速读】:该论文旨在解决遥感图像(Remote Sensing Image, RSI)解释中因标注数据稀缺而导致的性能受限问题。其解决方案的关键在于提出EarthSynth,一个基于扩散的生成式基础模型,能够合成多类别、跨卫星的地球观测标注数据,以支持下游RSI解释任务。EarthSynth通过反事实组合训练策略提升训练数据多样性并增强类别控制能力,并引入基于规则的R-Filter方法筛选更具信息量的合成数据,从而为开放世界场景下的场景分类、目标检测和语义分割提供有效的解决方案。

链接: https://arxiv.org/abs/2505.12108
作者: Jiancheng Pan,Shiye Lei,Yuqian Fu,Jiahao Li,Yanxing Liu,Yuze Sun,Xiao He,Long Peng,Xiaomeng Huang,Bo Zhao
机构: Tsinghua University (清华大学); University of Sydney (悉尼大学); INSAIT (INSAIT); Sofia University “St. Kliment Ohridski” (索非亚大学“圣基里尔·奥赫里德斯基”); University of Chinese Academy of Sciences (中国科学院大学); Wuhan University (武汉大学); University of Science and Technology of China (中国科学技术大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion-based generative foundation model that enables synthesizing multi-category, cross-satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of our knowledge, EarthSynth is the first to explore multi-task generation for remote sensing. EarthSynth, trained on the EarthSynth-180K dataset, employs the Counterfactual Composition training strategy to improve training data diversity and enhance category control. Furthermore, a rule-based method of R-Filter is proposed to filter more informative synthetic data for downstream tasks. We evaluate our EarthSynth on scene classification, object detection, and semantic segmentation in open-world scenarios, offering a practical solution for advancing RSI interpretation.
zh

[CV-161] nyRS-R1: Compact Multimodal Language Model for Remote Sensing BMVC2025

【速读】:该论文旨在解决当前远程感知任务中无法在边缘硬件上部署大规模多模态语言模型(Multimodal Language Model, MLLM)的问题。其关键解决方案是提出TinyRS及其推理增强版本TinyRS-R1,这是一个2B参数的多模态小语言模型(Multimodal Small Language Model, MSLM),通过四阶段训练流程(包括预训练、指令调优、基于思维链(Chain-of-Thought, CoT)注释的微调以及通过Group Relative Policy Optimization (GRPO) 的对齐)进行优化,从而在保持较低内存和延迟的同时,实现与7B参数远程感知模型相当或更优的性能。

链接: https://arxiv.org/abs/2505.12099
作者: Aybora Koksal,A. Aydin Alatan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to BMVC 2025. Code, models, and the captions for datasets will be released

点击查看摘要

Abstract:Remote-sensing applications often run on edge hardware that cannot host today’s 7B-parameter multimodal language models. This paper introduces TinyRS, the first 2B-parameter multimodal small language model (MSLM) optimized for remote sensing tasks, and TinyRS-R1, its reasoning-augmented variant. Built upon Qwen2-VL-2B, TinyRS is trained through a four-stage pipeline: pre-training on million satellite images, instruction tuning on visual instruction examples, fine-tuning with Chain-of-Thought (CoT) annotations from the proposed reasoning dataset, and alignment via Group Relative Policy Optimization (GRPO). TinyRS-R1 achieves or surpasses the performance of recent 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended question answering-while requiring just one-third of the memory and latency. Our analysis shows that CoT reasoning substantially benefits spatial grounding and scene understanding, while the non-reasoning TinyRS excels in concise, latency-sensitive VQA tasks. TinyRS-R1 represents the first domain-specialized MSLM with GRPO-aligned CoT reasoning for general-purpose remote sensing.
zh

[CV-162] LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation

【速读】:该论文旨在解决当前人工智能生成视频(AIGV)在感知质量和文本-视频对齐方面的局限性,从而提供一个可靠且可扩展的自动评估模型。其解决方案的关键在于构建了一个名为AIGVE-60K的全面数据集和基准,该数据集包含3,050个广泛提示、120K个平均意见分数(MOS)以及60K个问答对,并支持双向评估文本到视频(T2V)生成与视频到文本(V2T)解释能力。基于此数据集,作者提出了LOVE,一种基于大语言模型(LLM)的多维度评估指标,能够从感知偏好、文本-视频对应关系及任务特定准确性等方面进行评估。

链接: https://arxiv.org/abs/2505.12098
作者: Jiarui Wang,Huiyu Duan,Ziheng Jia,Yu Zhao,Woo Yi Yang,Zicheng Zhang,Zijian Chen,Juntong Wang,Yuke Xing,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in large multimodal models (LMMs) have driven substantial progress in both text-to-video (T2V) generation and video-to-text (V2T) interpretation tasks. However, current AI-generated videos (AIGVs) still exhibit limitations in terms of perceptual quality and text-video alignment. Therefore, a reliable and scalable automatic model for AIGV evaluation is desirable, which heavily relies on the scale and quality of human annotations. To this end, we present AIGVE-60K, a comprehensive dataset and benchmark for AI-Generated Video Evaluation, which features (i) comprehensive tasks, encompassing 3,050 extensive prompts across 20 fine-grained task dimensions, (ii) the largest human annotations, including 120K mean-opinion scores (MOSs) and 60K question-answering (QA) pairs annotated on 58,500 videos generated from 30 T2V models, and (iii) bidirectional benchmarking and evaluating for both T2V generation and V2T interpretation capabilities. Based on AIGVE-60K, we propose LOVE, a LMM-based metric for AIGV Evaluation from multiple dimensions including perceptual preference, text-video correspondence, and task-specific accuracy in terms of both instance level and model level. Comprehensive experiments demonstrate that LOVE not only achieves state-of-the-art performance on the AIGVE-60K dataset, but also generalizes effectively to a wide range of other AIGV evaluation benchmarks. These findings highlight the significance of the AIGVE-60K dataset. Database and codes are anonymously available at this https URL.
zh

[CV-163] VisionReason er: Unified Visual Perception and Reasoning via Reinforcement Learning

【速读】:该论文旨在解决多视觉感知任务在统一框架下的协同处理问题,即如何在一个模型中实现对检测、分割和计数等不同视觉任务的高效推理与求解。解决方案的关键在于设计新颖的多对象认知学习策略和系统化的任务重构方法,从而增强模型的推理能力,并在统一框架下处理多样化的视觉感知任务。

链接: https://arxiv.org/abs/2505.12081
作者: Yuqi Liu,Tianyuan Qu,Zhisheng Zhong,Bohao Peng,Shu Liu,Bei Yu,Jiaya Jia
机构: CUHK(香港中文大学); SmartMore(智元机器人); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing novel multi-object cognitive learning strategies and systematic task reformulation, VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks in a unified framework. The model generates a structured reasoning process before delivering the desired outputs responding to user queries. To rigorously assess unified visual perception capabilities, we evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting. Experimental results show that VisionReasoner achieves superior performance as a unified model, outperforming Qwen2.5VL by relative margins of 29.1% on COCO (detection), 22.1% on ReasonSeg (segmentation), and 15.3% on CountBench (counting).
zh

[CV-164] Denoising Mutual Knowledge Distillation in Bi-Directional Multiple Instance Learning

【速读】:该论文旨在解决多实例学习(Multiple Instance Learning, MIL)在全切片图像分类中难以准确学习袋级和实例级分类器的问题。尽管MIL避免了监督学习中细粒度标注的繁琐过程,但其在实例级别上的分类能力仍存在局限。论文提出的解决方案关键在于通过从弱到强的泛化技术引入伪标签校正能力,增强袋级和实例级的学习过程,从而提升双层级MIL算法在袋级和实例级预测上的性能。

链接: https://arxiv.org/abs/2505.12074
作者: Chen Shu,Boyu Fu,Yiman Li,Ting Yin,Wenchuan Zhang,Jie Chen,Yuhao Yi,Hong Bu
机构: Sichuan University(四川大学); Sichuan University-Pittsburgh Institute(四川大学-匹兹堡学院); Department of Pathology, West China Hospital, Sichuan University(华西医院病理科,四川大学); Institute of Clinical Pathology, West China Hospital, Sichuan University(华西医院临床病理研究所,四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Multiple Instance Learning is the predominant method for Whole Slide Image classification in digital pathology, enabling the use of slide-level labels to supervise model training. Although MIL eliminates the tedious fine-grained annotation process for supervised learning, whether it can learn accurate bag- and instance-level classifiers remains a question. To address the issue, instance-level classifiers and instance masks were incorporated to ground the prediction on supporting patches. These methods, while practically improving the performance of MIL methods, may potentially introduce noisy labels. We propose to bridge the gap between commonly used MIL and fully supervised learning by augmenting both the bag- and instance-level learning processes with pseudo-label correction capabilities elicited from weak to strong generalization techniques. The proposed algorithm improves the performance of dual-level MIL algorithms on both bag- and instance-level predictions. Experiments on public pathology datasets showcase the advantage of the proposed methods.
zh

[CV-165] MT-CYP-Net: Multi-Task Network for Pixel-Level Crop Yield Prediction Under Very Few Samples

【速读】:该论文旨在解决基于卫星遥感数据的像素级作物产量估计准确性受限于地面真实数据稀缺的问题。其解决方案的关键在于提出一种名为多任务作物产量预测网络(MT-CYP-Net)的新方法,该方法通过引入有效的多任务特征共享策略,使从共享主干网络中提取的特征能够同时被作物产量预测解码器和作物分类解码器使用,并具备信息融合能力,从而在仅有极少量作物产量点标签和作物类型标签的情况下,仍能生成详细的像素级作物产量图。

链接: https://arxiv.org/abs/2505.12069
作者: Shenzhou Liu,Di Wang,Haonan Guo,Chengxi Han,Wenzhi Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and fine-grained crop yield prediction plays a crucial role in advancing global agriculture. However, the accuracy of pixel-level yield estimation based on satellite remote sensing data has been constrained by the scarcity of ground truth data. To address this challenge, we propose a novel approach called the Multi-Task Crop Yield Prediction Network (MT-CYP-Net). This framework introduces an effective multi-task feature-sharing strategy, where features extracted from a shared backbone network are simultaneously utilized by both crop yield prediction decoders and crop classification decoders with the ability to fuse information between them. This design allows MT-CYP-Net to be trained with extremely sparse crop yield point labels and crop type labels, while still generating detailed pixel-level crop yield maps. Concretely, we collected 1,859 yield point labels along with corresponding crop type labels and satellite images from eight farms in Heilongjiang Province, China, in 2023, covering soybean, maize, and rice crops, and constructed a sparse crop yield label dataset. MT-CYP-Net is compared with three classical machine learning and deep learning benchmark methods in this dataset. Experimental results not only indicate the superiority of MT-CYP-Net compared to previous methods on multiple types of crops but also demonstrate the potential of deep networks on precise pixel-level crop yield prediction, especially with limited data labels.
zh

[CV-166] Beluga Whale Detection from Satellite Imagery with Point Labels WWW

【速读】:该论文试图解决现有基于深度学习的鲸类检测方法依赖人工创建的高质量边界框标注所带来的劳动密集型问题,以及排除“不确定鲸类”导致模型在实际应用场景中适用性受限的问题。解决方案的关键在于引入一种自动化流程,利用点标注和Segment Anything Model (SAM)生成精确的边界框标注,进而训练YOLOv8进行多类别检测,从而减少人工标注工作量并提升对不确定鲸类的检测能力。

链接: https://arxiv.org/abs/2505.12066
作者: Yijie Zheng,Jinxuan Yang,Yu Chen,Yaxuan Wang,Yihang Lu,Guoqing Li
机构: Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院遥感与数字地球研究所); University of Chinese Academy of Sciences(中国科学院大学); Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences(中国科学院地理科学与资源研究所); Chinese Research Academy of Environmental Sciences(中国环境科学研究院); School of Future Technology, University of Chinese Academy of Sciences(中国科学院大学未来技术学院); Technical Institute of Physics and Chemistry, Chinese Academy of Sciences(中国科学院物理化学技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for oral presentation at IGARSS 2025. Session at this https URL

点击查看摘要

Abstract:Very high-resolution (VHR) satellite imagery has emerged as a powerful tool for monitoring marine animals on a large scale. However, existing deep learning-based whale detection methods usually require manually created, high-quality bounding box annotations, which are labor-intensive to produce. Moreover, existing studies often exclude ``uncertain whales’', individuals that have ambiguous appearances in satellite imagery, limiting the applicability of these models in real-world scenarios. To address these limitations, this study introduces an automated pipeline for detecting beluga whales and harp seals in VHR satellite imagery. The pipeline leverages point annotations and the Segment Anything Model (SAM) to generate precise bounding box annotations, which are used to train YOLOv8 for multiclass detection of certain whales, uncertain whales, and harp seals. Experimental results demonstrated that SAM-generated annotations significantly improved detection performance, achieving higher \textF_\text1 -scores compared to traditional buffer-based annotations. YOLOv8 trained on SAM-labeled boxes achieved an overall \textF_\text1 -score of 72.2% for whales overall and 70.3% for harp seals, with superior performance in dense scenes. The proposed approach not only reduces the manual effort required for annotation but also enhances the detection of uncertain whales, offering a more comprehensive solution for marine animal monitoring. This method holds great potential for extending to other species, habitats, and remote sensing platforms, as well as for estimating whale biometrics, thereby advancing ecological monitoring and conservation efforts. The codes for our label and detection pipeline are publicly available at this http URL .
zh

[CV-167] VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption

【速读】:该论文旨在解决基于潜在扩散模型的现代视频生成框架中由于“帧比例信息假设”导致的效率问题,即现有分词器提供固定的时序压缩率,使得扩散模型的计算成本随帧率线性增长。其解决方案的关键在于提出“持续时间比例信息假设”,即视频的信息容量上限与持续时间成正比而非帧数,并据此引入基于Transformer的视频分词器VFRTok,通过编码器与解码器之间的非对称帧率训练实现可变帧率的编码与解码。此外,该方法还引入部分旋转位置嵌入(Partial RoPE)以解耦位置与内容建模,提升内容感知能力,从而在仅使用1/8令牌的情况下实现竞争性的重建质量和最先进的生成保真度。

链接: https://arxiv.org/abs/2505.12053
作者: Tianxiong Zhong,Xingye Tian,Boyuan Jiang,Xuebo Wang,Xin Tao,Pengfei Wan,Zhiwei Zhang
机构: Beijing Institute of Technology (北京理工大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:Modern video generation frameworks based on Latent Diffusion Models suffer from inefficiencies in tokenization due to the Frame-Proportional Information Assumption. Existing tokenizers provide fixed temporal compression rates, causing the computational cost of the diffusion model to scale linearly with the frame rate. The paper proposes the Duration-Proportional Information Assumption: the upper bound on the information capacity of a video is proportional to the duration rather than the number of frames. Based on this insight, the paper introduces VFRTok, a Transformer-based video tokenizer, that enables variable frame rate encoding and decoding through asymmetric frame rate training between the encoder and decoder. Furthermore, the paper proposes Partial Rotary Position Embeddings (RoPE) to decouple position and content modeling, which groups correlated patches into unified tokens. The Partial RoPE effectively improves content-awareness, enhancing the video generation capability. Benefiting from the compact and continuous spatio-temporal representation, VFRTok achieves competitive reconstruction quality and state-of-the-art generation fidelity while using only 1/8 tokens compared to existing tokenizers.
zh

[CV-168] Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion ICDM

【速读】:该论文旨在解决 hate video(仇恨视频)检测的难题,尤其是在多模态数据中有效捕捉视频、音频和文本之间的复杂交互与时间动态的问题。现有方法多采用单模态或未能充分整合模态间关系的多模态技术,导致对隐含性仇恨内容的识别效果有限。论文提出的解决方案关键在于 CMFusion 模型,其核心创新是通道级和模态级融合机制(Channel-wise and Modality-wise Fusion Mechanism),通过预训练模型提取多模态特征,并结合时间交叉注意力机制捕捉视频与音频流间的依赖关系,从而获得更具表征力的视频表示。

链接: https://arxiv.org/abs/2505.12051
作者: Yinghui Zhang,Tailin Chen,Yuchen Zhang,Zeyu Fu
机构: University of Exeter (埃克塞特大学); University of Essex (埃塞克斯大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICDMW 2024, Github: this https URL

点击查看摘要

Abstract:The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using pre-trained models and then incorporates a temporal cross-attention mechanism to capture dependencies between video and audio streams. The learned features are then processed by channel-wise and modality-wise fusion modules to obtain informative representations of videos. Our extensive experiments on a real-world dataset demonstrate that CMFusion significantly outperforms five widely used baselines in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation studies and parameter analyses further validate our design choices, highlighting the model’s effectiveness in detecting hate videos. The source codes will be made publicly available at this https URL.
zh

[CV-169] Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

【速读】:该论文旨在解决基于扩散模型的超分辨率(SR)方法在计算成本上的问题,尤其是其高迭代次数导致的训练和推理效率低下。解决方案的关键在于提出一种无需额外训练成本的时间-空间感知采样策略(Time-Spatial-aware Sampling, TSS),该策略通过结合时间动态采样(Time Dynamic Sampling, TDS)和空间动态采样(Spatial Dynamic Sampling, SDS)来优化扩散过程,从而在减少迭代次数的同时提升图像质量。

链接: https://arxiv.org/abs/2505.12048
作者: Rui Qin,Qijie Wang,Ming Sun,Haowei Zhu,Chao Zhou,Bin Wang
机构: Tsinghua University (清华大学); BNRist (北京信息科学技术研究院); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2 - 3.0 and outperforming the current acceleration methods with only half the number of steps.
zh

[CV-170] FIGhost: Fluorescent Ink-based Stealthy and Flexible Backdoor Attacks on Physical Traffic Sign Recognition

【速读】:该论文旨在解决交通标志识别(Traffic Sign Recognition, TSR)系统在自动驾驶中面临的后门攻击问题。现有物理后门攻击方法存在隐蔽性不足、攻击控制不灵活或忽视新兴的视觉-大语言模型(Vision-Large-Language-Models, VLMs)的问题。论文提出的解决方案关键在于引入FIGhost,这是一种利用荧光墨水作为触发器的物理世界后门攻击方法。荧光触发器在正常条件下不可见,通过紫外线激活,具备优异的隐蔽性、灵活性和不可追踪性。此外,通过基于插值的荧光模拟算法增强触发器的鲁棒性,并开发了自动化后门样本生成方法以支持多种攻击目标。

链接: https://arxiv.org/abs/2505.12045
作者: Shuai Yuan,Guowen Xu,Hongwei Li,Rui Zhang,Xinyuan Qian,Wenbo Jiang,Hangcheng Cao,Qingchuan Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traffic sign recognition (TSR) systems are crucial for autonomous driving but are vulnerable to backdoor attacks. Existing physical backdoor attacks either lack stealth, provide inflexible attack control, or ignore emerging Vision-Large-Language-Models (VLMs). In this paper, we introduce FIGhost, the first physical-world backdoor attack leveraging fluorescent ink as triggers. Fluorescent triggers are invisible under normal conditions and activated stealthily by ultraviolet light, providing superior stealthiness, flexibility, and untraceability. Inspired by real-world graffiti, we derive realistic trigger shapes and enhance their robustness via an interpolation-based fluorescence simulation algorithm. Furthermore, we develop an automated backdoor sample generation method to support three attack objectives. Extensive evaluations in the physical world demonstrate FIGhost’s effectiveness against state-of-the-art detectors and VLMs, maintaining robustness under environmental variations and effectively evading existing defenses.
zh

[CV-171] Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment

【速读】:该论文试图解决任务算术(task arithmetic)在跨模型迁移设置中的适用性问题,即当源模型和目标模型独立地在不同数据集上预训练时,传统任务算术方法因假设两者从相同的预训练参数初始化而无法有效应用。解决方案的关键在于提出一种基于少量样本正交对齐的方法,该方法将任务向量对齐到不同预训练的目标模型参数空间中,同时保留任务向量的范数和秩等关键特性,并仅使用少量标注样本进行学习。

链接: https://arxiv.org/abs/2505.12021
作者: Kazuhiko Kawamoto,Atsuhiro Endo,Hiroshi Kera
机构: Chiba University (千叶大学); Hitachi, Ltd. (日立有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Task arithmetic enables efficient model editing by representing task-specific changes as vectors in parameter space. Task arithmetic typically assumes that the source and target models are initialized from the same pre-trained parameters. This assumption limits its applicability in cross-model transfer settings, where models are independently pre-trained on different datasets. To address this challenge, we propose a method based on few-shot orthogonal alignment, which aligns task vectors to the parameter space of a differently pre-trained target model. These transformations preserve key properties of task vectors, such as norm and rank, and are learned using only a small number of labeled examples. We evaluate the method using two Vision Transformers pre-trained on YFCC100M and LAION400M, and test on eight classification datasets. Experimental results show that our method improves transfer accuracy over direct task vector application and achieves performance comparable to few-shot fine-tuning, while maintaining the modularity and reusability of task vectors. Our code is available at this https URL.
zh

[CV-172] Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation

【速读】:该论文旨在解决现有表达性人体姿态与形状(EHPS)模型在安全漏洞方面的研究不足问题,特别是针对对抗攻击的防御能力较弱。现有针对EHPS模型的对抗攻击通常需要白盒访问或生成视觉明显的扰动,限制了其实际应用和对真实世界安全威胁的暴露能力。论文提出的解决方案是设计一种无感知的黑盒攻击(Unnoticeable Black-Box Attack, UBA),其关键在于利用自然图像的潜在空间表示生成最优对抗噪声,并在数字空间中沿优化方向迭代提升攻击效果,仅依赖模型输出进行查询,无需了解模型内部结构,从而实现更高的隐蔽性和有效性。

链接: https://arxiv.org/abs/2505.12009
作者: Zhiying Li,Guanggang Geng,Yeying Jin,Zhizhi Guo,Bruce Gu,Jidong Huo,Zhaoxin Fan,Wenjun Wu
机构: Jinan University (济南大学); Beihang University (北京航空航天大学); National University of Singapore (新加坡国立大学); Teleai; National Supercomputer Center (国家超算中心); Qilu University of Technology (齐鲁工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Expressive human pose and shape (EHPS) estimation is vital for digital human generation, particularly in live-streaming applications. However, most existing EHPS models focus primarily on minimizing estimation errors, with limited attention on potential security vulnerabilities. Current adversarial attacks on EHPS models often require white-box access (e.g., model details or gradients) or generate visually conspicuous perturbations, limiting their practicality and ability to expose real-world security threats. To address these limitations, we propose a novel Unnoticeable Black-Box Attack (UBA) against EHPS models. UBA leverages the latent-space representations of natural images to generate an optimal adversarial noise pattern and iteratively refine its attack potency along an optimized direction in digital space. Crucially, this process relies solely on querying the model’s output, requiring no internal knowledge of the EHPS architecture, while guiding the noise optimization toward greater stealth and effectiveness. Extensive experiments and visual analyses demonstrate the superiority of UBA. Notably, UBA increases the pose estimation errors of EHPS models by 17.27%-58.21% on average, revealing critical vulnerabilities. These findings underscore the urgent need to address and mitigate security risks associated with digital human generation systems.
zh

[CV-173] Multi-modal Collaborative Optimization and Expansion Network for Event-assisted Single-eye Expression Recognition

【速读】:该论文旨在解决单眼表情识别任务中面临的低光照、高曝光和高动态范围等挑战。其解决方案的关键在于提出一种多模态协同优化与扩展网络(MCO-E Net),其中包含两个创新设计:多模态协同优化Mamba(MCO-Mamba)和异构协同与扩展混合专家(HCE-MoE)。MCO-Mamba通过双模态信息联合优化模型,促进模态语义的协同交互与融合;而HCE-MoE则通过动态路由机制分配结构各异的专家(深度、注意力和焦点专家),实现互补语义的协同学习,从而系统性地整合多种特征提取范式,全面捕捉表情语义。

链接: https://arxiv.org/abs/2505.12007
作者: Runduo Han,Xiuping Liu,Shangxuan Yi,Yi Zhang,Hongchen Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we proposed a Multi-modal Collaborative Optimization and Expansion Network (MCO-E Net), to use event modalities to resist challenges such as low light, high exposure, and high dynamic range in single-eye expression recognition tasks. The MCO-E Net introduces two innovative designs: Multi-modal Collaborative Optimization Mamba (MCO-Mamba) and Heterogeneous Collaborative and Expansion Mixture-of-Experts (HCE-MoE). MCO-Mamba, building upon Mamba, leverages dual-modal information to jointly optimize the model, facilitating collaborative interaction and fusion of modal semantics. This approach encourages the model to balance the learning of both modalities and harness their respective strengths. HCE-MoE, on the other hand, employs a dynamic routing mechanism to distribute structurally varied experts (deep, attention, and focal), fostering collaborative learning of complementary semantics. This heterogeneous architecture systematically integrates diverse feature extraction paradigms to comprehensively capture expression semantics. Extensive experiments demonstrate that our proposed network achieves competitive performance in the task of single-eye expression recognition, especially under poor lighting conditions.
zh

[CV-174] CHRIS: Clothed Human Reconstruction with Side View Consistency ICME2025

【速读】:该论文旨在解决从单视角RGB图像重建真实着装人体时存在的全局拓扑不真实和局部表面不一致问题,尤其是在侧视图中表现不佳。其解决方案的关键在于提出CHRIS框架,该框架包含两个核心组件:1)侧视图法线判别器,用于通过区分生成的侧视图法线与真实法线来增强全局视觉合理性;2)多对一梯度计算(M2O),通过整合邻近点的梯度来计算采样点的梯度,从而确保局部表面一致性。

链接: https://arxiv.org/abs/2505.12005
作者: Dong Liu,Yifan Yang,Zixiong Huang,Yuxin Gao,Mingkui Tan
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICME 2025

点击查看摘要

Abstract:Creating a realistic clothed human from a single-view RGB image is crucial for applications like mixed reality and filmmaking. Despite some progress in recent years, mainstream methods often fail to fully utilize side-view information, as the input single-view image contains front-view information only. This leads to globally unrealistic topology and local surface inconsistency in side views. To address these, we introduce Clothed Human Reconstruction with Side View Consistency, namely CHRIS, which consists of 1) A Side-View Normal Discriminator that enhances global visual reasonability by distinguishing the generated side-view normals from the ground truth ones; 2) A Multi-to-One Gradient Computation (M2O) that ensures local surface consistency. M2O calculates the gradient of a sampling point by integrating the gradients of the nearby points, effectively acting as a smooth operation. Experimental results demonstrate that CHRIS achieves state-of-the-art performance on public benchmarks and outperforms the prior work.
zh

[CV-175] IQBench: How "Smart Are Vision-Language Models? A Study with Human IQ Tests

【速读】:该论文试图解决当前大规模视觉-语言模型(VLMs)在人类智商测试中的真实推理能力尚未得到充分探索的问题,旨在推动对VLMs流体智能的研究。其解决方案的关键在于引入IQBench,一个专门用于评估VLMs在标准化视觉智商测试中的基准,该基准以视觉为中心,最大限度地减少对非必要文本内容的依赖,从而促使模型主要从图像信息中推导答案,而非依赖已学习的文本知识。此外,通过人工收集和标注500个视觉智商问题,防止训练过程中的无意数据泄露,并通过评估模型的解释、解题模式以及最终预测的准确性与人类评价来综合衡量模型的推理能力。

链接: https://arxiv.org/abs/2505.12000
作者: Tan-Hanh Pham,Phu-Vinh Nguyen,Dang The Hung,Bui Trong Duong,Vu Nguyen Thanh,Chris Ngo,Tri Quang Truong,Truong-Son Hy
机构: Harvard Medical School (哈佛医学院); Uppsala University (乌普萨拉大学); University of London (伦敦大学); Vietnam Military Medical University (越南军事医学院); University of Technical Education Ho Chi Minh City (胡志明市技术教育大学); Knovel Engineering Lab (诺沃尔工程实验室); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IQ Test for Multimodal Models

点击查看摘要

Abstract:Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce IQBench, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. Our benchmark is visually centric, minimizing the dependence on unnecessary textual content, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to prevent unintentional data leakage during training. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as o4-mini, gemini-2.5-flash, and claude-3.7-sonnet achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs’ general reasoning abilities. In terms of reasoning scores, o4-mini, gemini-2.5-flash, and claude-3.7-sonnet achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.
zh

[CV-176] Parameter Efficient Continual Learning with Dynamic Low-Rank Adaptation

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中深度神经网络面临的灾难性遗忘(Catastrophic Forgetting)问题,即在学习新任务时会破坏已巩固的知识。其解决方案的关键在于提出PEARL框架,该框架通过在CL训练过程中对低秩适配器(LoRA)组件进行动态秩分配,以提高模型对新任务的适应能力并减少资源浪费。PEARL利用参考任务权重,并根据当前任务与参考任务权重在参数空间中的接近程度,自适应地确定任务特定LoRA组件的秩。

链接: https://arxiv.org/abs/2505.11998
作者: Prashant Shivaram Bhat,Shakib Yazdani,Elahe Arani,Bahram Zonooz
机构: Eindhoven University of Technology (埃因霍温理工大学); Wayve; Saarland University (萨尔兰大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 5 figures

点击查看摘要

Abstract:Catastrophic forgetting has remained a critical challenge for deep neural networks in Continual Learning (CL) as it undermines consolidated knowledge when learning new tasks. Parameter efficient fine tuning CL techniques are gaining traction for their effectiveness in addressing catastrophic forgetting with a lightweight training schedule while avoiding degradation of consolidated knowledge in pre-trained models. However, low rank adapters (LoRA) in these approaches are highly sensitive to rank selection which can lead to sub-optimal resource allocation and performance. To this end, we introduce PEARL, a rehearsal-free CL framework that entails dynamic rank allocation for LoRA components during CL training. Specifically, PEARL leverages reference task weights and adaptively determines the rank of task-specific LoRA components based on the current tasks’ proximity to reference task weights in parameter space. To demonstrate the versatility of PEARL, we evaluate it across three vision architectures (ResNet, Separable Convolutional Network and Vision Transformer) and a multitude of CL scenarios, and show that PEARL outperforms all considered baselines by a large margin.
zh

[CV-177] Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance

【速读】:该论文旨在解决癌症生存预测中多模态病理-基因组数据融合的挑战,特别是现有研究在利用多实例学习聚合图像切片特征时忽略了病理图像中的上下文和层次细节信息丢失问题,以及病理学与基因组学数据粒度和维度差异导致的模态不平衡问题。论文提出的解决方案关键在于引入超图学习以有效捕捉病理图像的上下文和层次细节,并采用模态再平衡机制和交互对齐融合策略,动态调整两种模态的贡献权重,从而缓解病理学与基因组学之间的不平衡问题。

链接: https://arxiv.org/abs/2505.11997
作者: Mingcheng Qu,Guang Yang,Donglin,Tonghua Su,Yue Gao,Yang Song,Lei Fan
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学); UNSW Sydney (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Multimodal pathology-genomic analysis has become increasingly prominent in cancer survival prediction. However, existing studies mainly utilize multi-instance learning to aggregate patch-level features, neglecting the information loss of contextual and hierarchical details within pathology images. Furthermore, the disparity in data granularity and dimensionality between pathology and genomics leads to a significant modality imbalance. The high spatial resolution inherent in pathology data renders it a dominant role while overshadowing genomics in multimodal integration. In this paper, we propose a multimodal survival prediction framework that incorporates hypergraph learning to effectively capture both contextual and hierarchical details from pathology images. Moreover, it employs a modality rebalance mechanism and an interactive alignment fusion strategy to dynamically reweight the contributions of the two modalities, thereby mitigating the pathology-genomics imbalance. Quantitative and qualitative experiments are conducted on five TCGA datasets, demonstrating that our model outperforms advanced methods by over 3.4% in C-Index performance.
zh

[CV-178] SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

【速读】:该论文试图解决从稀疏或单视角输入中重建逼真三维场景的问题(Novel view synthesis, NVS),现有技术依赖于密集多视角观测,限制了其应用范围。解决方案的关键在于引入SpatialCrafter框架,该框架利用视频扩散模型中的丰富知识生成合理的附加观测,缓解重建歧义,并通过可训练的相机编码器和极线注意力机制实现精确的相机控制与三维一致性,同时采用统一尺度估计策略处理数据集间的尺度差异。此外,通过将单目深度先验与视频潜在空间中的语义特征结合,直接回归三维高斯基元,并通过混合网络结构高效处理长序列特征。

链接: https://arxiv.org/abs/2505.11992
作者: Songchun Zhang,Huiyao Xu,Sitong Guo,Zhongwei Xie,Pengwei Liu,Hujun Bao,Weiwei Xu,Changqing Zou
机构: Zhejiang University (浙江大学); Zhejiang Lab (浙江省实验室); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 16 figures

点击查看摘要

Abstract:Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs. We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets. Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes.
zh

[CV-179] Online Iterative Self-Alignment for Radiology Report Generation ACL2025

【速读】:该论文旨在解决放射学报告生成(Radiology Report Generation, RRG)中因依赖高质量标注数据而导致的过拟合和泛化能力不足的问题。其解决方案的关键在于提出一种新颖的在线迭代自对齐(Online Iterative Self-Alignment, OISA)方法,该方法通过四个阶段实现:自生成多样化数据、自评估多目标偏好数据、自对齐多目标优化以及自迭代进一步提升性能,从而在不依赖大量人工标注数据的情况下,持续优化模型生成报告的质量与多样性。

链接: https://arxiv.org/abs/2505.11983
作者: Ting Xiao,Lei Shi,Yang Zhang,HaoFeng Yang,Zhe Wang,Chenjia Bai
机构: East China University of Science and Technology (华东理工大学); Tsinghua University (清华大学); Institute of Artificial Intelligence (TeleAI) (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Main

点击查看摘要

Abstract:Radiology Report Generation (RRG) is an important research topic for relieving radiologist’ heavy workload. Existing RRG models mainly rely on supervised fine-tuning (SFT) based on different model architectures using data pairs of radiological images and corresponding radiologist-annotated reports. Recent research has shifted focus to post-training improvements, aligning RRG model outputs with human preferences using reinforcement learning (RL). However, the limited data coverage of high-quality annotated data poses risks of overfitting and generalization. This paper proposes a novel Online Iterative Self-Alignment (OISA) method for RRG that consists of four stages: self-generation of diverse data, self-evaluation for multi-objective preference data,self-alignment for multi-objective optimization and self-iteration for further improvement. Our approach allows for generating varied reports tailored to specific clinical objectives, enhancing the overall performance of the RRG model iteratively. Unlike existing methods, our frame-work significantly increases data quality and optimizes performance through iterative multi-objective optimization. Experimental results demonstrate that our method surpasses previous approaches, achieving state-of-the-art performance across multiple evaluation metrics.
zh

[CV-180] AoP-SAM: Automation of Prompts for Efficient Segmentation AAAI2025

【速读】:该论文试图解决生成式 AI (Generative AI) 在图像分割任务中依赖手动提示(prompt)导致的效率低下和实用性不足的问题,特别是在需要快速提供提示和资源高效的应用场景中。解决方案的关键在于提出一种自动化提示生成方法——AoP-SAM,该方法通过一个轻量且高效的提示预测模型,自动检测图像中的关键实体并确定最佳提示位置,从而消除人工输入,提升SAM(Segment Anything Model)的效率和适用性。此外,引入了基于实例级别的测试时自适应采样与过滤机制,以粗到细的方式生成提示,进一步提高了提示和分割掩码生成的效率与准确性。

链接: https://arxiv.org/abs/2505.11980
作者: Yi Chen,Mu-Young Son,Chuanbo Hua,Joo-Young Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:The Segment Anything Model (SAM) is a powerful foundation model for image segmentation, showing robust zero-shot generalization through prompt engineering. However, relying on manual prompts is impractical for real-world applications, particularly in scenarios where rapid prompt provision and resource efficiency are crucial. In this paper, we propose the Automation of Prompts for SAM (AoP-SAM), a novel approach that learns to generate essential prompts in optimal locations automatically. AoP-SAM enhances SAM’s efficiency and usability by eliminating manual input, making it better suited for real-world tasks. Our approach employs a lightweight yet efficient Prompt Predictor model that detects key entities across images and identifies the optimal regions for placing prompt candidates. This method leverages SAM’s image embeddings, preserving its zero-shot generalization capabilities without requiring fine-tuning. Additionally, we introduce a test-time instance-level Adaptive Sampling and Filtering mechanism that generates prompts in a coarse-to-fine manner. This notably enhances both prompt and mask generation efficiency by reducing computational overhead and minimizing redundant mask refinements. Evaluations of three datasets demonstrate that AoP-SAM substantially improves both prompt generation efficiency and mask generation accuracy, making SAM more effective for automated segmentation tasks.
zh

[CV-181] Advanced Integration of Discrete Line Segments in Digitized PID for Continuous Instrument Connectivity

【速读】:该论文试图解决传统手工绘制和映射管道与仪表图(Piping and Instrumentation Diagrams, PIDs)过程中耗时长、易出错以及依赖领域专家经验的问题。其解决方案的关键在于利用计算机视觉模型检测线条段,并通过合并这些检测到的线段来建立设备与线段之间的连接,从而实现PIDs的数字化,最终可将信息存储于知识图谱中以支持后续的高级分析任务。

链接: https://arxiv.org/abs/2505.11976
作者: Soumya Swarup Prusty,Astha Agarwal,Srinivasan Iyenger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 13 figures

点击查看摘要

Abstract:Piping and Instrumentation Diagrams (PIDs) constitute the foundational blueprint of a plant, depicting the interconnections among process equipment, instrumentation for process control, and the flow of fluids and control signals. In their existing setup, the manual mapping of information from PID sheets holds a significant challenge. This is a time-consuming process, taking around 3-6 months, and is susceptible to errors. It also depends on the expertise of the domain experts and often requires multiple rounds of review. The digitization of PIDs entails merging detected line segments, which is essential for linking various detected instruments, thereby creating a comprehensive digitized PID. This paper focuses on explaining how line segments which are detected using a computer vision model are merged and eventually building the connection between equipment and merged lines. Hence presenting a digitized form of information stating the interconnection between process equipment, instrumentation, flow of fluids and control signals. Eventually, which can be stored in a knowledge graph and that information along with the help of advanced algorithms can be leveraged for tasks like finding optimal routes, detecting system cycles, computing transitive closures, and more.
zh

[CV-182] op-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning

【速读】:该论文旨在解决视觉指令微调(Visual Instruction Tuning)中视觉到语言映射的有效性问题,尤其是现有方法在准确性和效率之间难以平衡的困境。其解决方案的关键在于提出了一种名为Top-Down Compression的新型压缩范式,通过策略性地压缩视觉标记而不损失核心信息,结合可训练的Flash Global Fusion模块,实现特征空间对齐并低成本地感知整体视觉上下文和指令偏好,同时采用局部到单个扫描方式增强视觉建模能力,并引入Visual-Native Selection机制以降低计算开销。

链接: https://arxiv.org/abs/2505.11945
作者: Bonan li,Zicheng Zhang,Songhua Liu,Weihao Yu,Xinchao Wang
机构: UCAS(中国科学院大学); NUS(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Visual instruction tuning aims to enable large language models to comprehend the visual world, with a pivotal challenge lying in establishing an effective vision-to-language projection. However, existing methods often grapple with the intractable trade-off between accuracy and efficiency. In this paper, we present LLaVA-Meteor, a novel approach designed to break this deadlock, equipped with a novel Top-Down Compression paradigm that strategically compresses visual tokens without compromising core information. Specifically, we construct a trainable Flash Global Fusion module based on efficient selective state space operators, which aligns the feature space while enabling each token to perceive holistic visual context and instruction preference at low cost. Furthermore, a local-to-single scanning manner is employed to effectively capture local dependencies, thereby enhancing the model’s capability in vision modeling. To alleviate computational overhead, we explore a Visual-Native Selection mechanism that independently assesses token significance by both the visual and native experts, followed by aggregation to retain the most critical subset. Extensive experiments show that our approach reduces visual tokens by 75–95% while achieving comparable or superior performance across 12 benchmarks, significantly improving efficiency.
zh

[CV-183] SegMan: Interactive Segment-and-Manipulate 3D Gaussians CVPR2025

【速读】:该论文旨在解决3D场景操作中区域控制不精确以及缺乏用户交互反馈的问题,现有方法在控制操作区域和提供实时交互反馈方面存在局限,导致结果不可预测。其解决方案的关键在于提出交互式Segment-and-Manipulate 3D Gaussians(iSegMan),该框架通过简单的2D用户交互实现高效的3D区域控制,核心创新包括Epipolar-guided Interaction Propagation(EIP)和Visibility-based Gaussian Voting(VGV),前者利用极线约束实现交互的跨视图传播,后者通过基于可见性的投票机制避免场景特定训练,从而提升操作效率与灵活性。

链接: https://arxiv.org/abs/2505.11934
作者: Yian Zhao,Wanshi Xu,Ruochong Zheng,Pengchong Qiao,Chang Liu,Jie Chen
机构: Peking University, Shenzhen, China (北京大学,深圳,中国); Pengcheng Laboratory, Shenzhen, China (鹏城实验室,深圳,中国); Peking University Shenzhen Graduate School, China (北京大学深圳研究生院,中国); Tsinghua University, Beijing, China (清华大学,北京,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:The efficient rendering and explicit nature of 3DGS promote the advancement of 3D scene manipulation. However, existing methods typically encounter challenges in controlling the manipulation region and are unable to furnish the user with interactive feedback, which inevitably leads to unexpected results. Intuitively, incorporating interactive 3D segmentation tools can compensate for this deficiency. Nevertheless, existing segmentation frameworks impose a pre-processing step of scene-specific parameter training, which limits the efficiency and flexibility of scene manipulation. To deliver a 3D region control module that is well-suited for scene manipulation with reliable efficiency, we propose interactive Segment-and-Manipulate 3D Gaussians (iSegMan), an interactive segmentation and manipulation framework that only requires simple 2D user interactions in any view. To propagate user interactions to other views, we propose Epipolar-guided Interaction Propagation (EIP), which innovatively exploits epipolar constraint for efficient and robust interaction matching. To avoid scene-specific training to maintain efficiency, we further propose the novel Visibility-based Gaussian Voting (VGV), which obtains 2D segmentations from SAM and models the region extraction as a voting game between 2D Pixels and 3D Gaussians based on Gaussian visibility. Taking advantage of the efficient and precise region control of EIP and VGV, we put forth a Manipulation Toolbox to implement various functions on selected regions, enhancing the controllability, flexibility and practicality of scene manipulation. Extensive results on 3D scene manipulation and segmentation tasks fully demonstrate the significant advantages of iSegMan. Project page is available at this https URL.
zh

[CV-184] SafeVid: Toward Safety Aligned Video Large Multimodal Models

【速读】:该论文旨在解决视频大模型(Video Large Multimodal Models, VLMMs)在动态视频场景中因静态安全对齐失效而导致的安全性问题,即“不匹配泛化”问题。其解决方案的关键在于提出SafeVid框架,通过将文本安全对齐能力迁移至视频领域,利用详细的文本视频描述作为解释性桥梁,实现基于大语言模型(LLM)的规则驱动安全推理。该框架包括三个核心部分:生成SafeVid-350K视频特定安全偏好数据集、使用直接偏好优化(DPO)对VLMM进行定向对齐,以及通过SafeVidBench基准进行全面评估。

链接: https://arxiv.org/abs/2505.11926
作者: Yixu Wang,Jiaxin Song,Yifeng Gao,Xin Wang,Yang Yao,Yan Teng,Xingjun Ma,Yingchun Wang,Yu-Gang Jiang
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Video Large Multimodal Models (VLMMs) rapidly advance, their inherent complexity introduces significant safety challenges, particularly the issue of mismatched generalization where static safety alignments fail to transfer to dynamic video contexts. We introduce SafeVid, a framework designed to instill video-specific safety principles in VLMMs. SafeVid uniquely transfers robust textual safety alignment capabilities to the video domain by employing detailed textual video descriptions as an interpretive bridge, facilitating LLM-based rule-driven safety reasoning. This is achieved through a closed-loop system comprising: 1) generation of SafeVid-350K, a novel 350,000-pair video-specific safety preference dataset; 2) targeted alignment of VLMMs using Direct Preference Optimization (DPO); and 3) comprehensive evaluation via our new SafeVidBench benchmark. Alignment with SafeVid-350K significantly enhances VLMM safety, with models like LLaVA-NeXT-Video demonstrating substantial improvements (e.g., up to 42.39%) on SafeVidBench. SafeVid provides critical resources and a structured approach, demonstrating that leveraging textual descriptions as a conduit for safety reasoning markedly improves the safety alignment of VLMMs. We have made SafeVid-350K dataset (this https URL) publicly available.
zh

[CV-185] DC-Seg: Disentangled Contrastive Learning for Brain Tumor Segmentation with Missing Modalities

【速读】:该论文旨在解决多模态脑图像分割中因部分模态数据缺失而导致的性能下降问题。传统方法通过将多模态信息编码到共享潜在空间中来应对这一挑战,但这种方法未能充分挖掘各模态的独特信息。论文提出的解决方案关键在于DC-Seg(Disentangled Contrastive Learning for Segmentation),该方法通过解耦学习分别获得模态不变的解剖表示和模态特定的表示,利用解剖对比学习和模态对比学习实现特征分离,从而提升模型对模态缺失的鲁棒性。

链接: https://arxiv.org/abs/2505.11921
作者: Haitao Li,Ziyu Li,Yiheng Mao,Zhengyao Ding,Zhengxing Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of brain images typically requires the integration of complementary information from multiple image modalities. However, clinical data for all modalities may not be available for every patient, creating a significant challenge. To address this, previous studies encode multiple modalities into a shared latent space. While somewhat effective, it remains suboptimal, as each modality contains distinct and valuable information. In this study, we propose DC-Seg (Disentangled Contrastive Learning for Segmentation), a new method that explicitly disentangles images into modality-invariant anatomical representation and modality-specific representation, by using anatomical contrastive learning and modality contrastive learning respectively. This solution improves the separation of anatomical and modality-specific features by considering the modality gaps, leading to more robust representations. Furthermore, we introduce a segmentation-based regularizer that enhances the model’s robustness to missing modalities. Extensive experiments on the BraTS 2020 and a private white matter hyperintensity(WMH) segmentation dataset demonstrate that DC-Seg outperforms state-of-the-art methods in handling incomplete multimodal brain tumor segmentation tasks with varying missing modalities, while also demonstrate strong generalizability in WMH segmentation. The code is available at this https URL.
zh

[CV-186] Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning ?

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在全景视角(omnidirectional view)下的空间推理能力不足的问题。现有研究主要聚焦于标准针孔视角图像,而对全景场景的感知与推理尚未得到充分探索。解决方案的关键在于提出OSR-Bench,这是首个针对全景空间推理设计的基准测试平台,包含超过153,000个基于高保真全景室内场景地图的多样化问答对,并引入负采样策略以评估模型的幻觉和定位鲁棒性。此外,还设计了一个两阶段评估框架,结合旋转不变匹配及规则基础与LLM基础的度量标准,以全面评估认知地图生成与问答准确性。

链接: https://arxiv.org/abs/2505.11907
作者: Zihao Dongfang,Xu Zheng,Ziqiao Weng,Yuanhuiyi Lyu,Danda Pani Paudel,Luc Van Gool,Kailun Yang,Xuming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: this https URL
zh

[CV-187] GTR: Gaussian Splatting Tracking and Reconstruction of Unknown Objects Based on Appearance and Geometric Complexity

【速读】:该论文旨在解决单目RGBD视频中6-DoF物体跟踪与高质量3D重建的问题,尤其针对具有对称性、复杂几何结构或复杂外观的物体所面临的挑战。其解决方案的关键在于引入一种自适应方法,结合了3D Gaussian Splatting(三维高斯点云渲染)、混合几何/外观跟踪以及关键帧选择,从而实现对多种物体的鲁棒跟踪与精确重建。

链接: https://arxiv.org/abs/2505.11905
作者: Takuya Ikeda,Sergey Zakharov,Muhammad Zubair Irshad,Istvan Balazs Opra,Shun Iwase,Dian Chen,Mark Tjersland,Robert Lee,Alexandre Dilly,Rares Ambrus,Koichi Nishiwaki
机构: Woven by Toyota, Inc. (丰田公司编织部门); Toyota Research Institute (丰田研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: main contains 10 pages, 9 figures. And supplementary material contains 10 pages, 27 figures

点击查看摘要

Abstract:We present a novel method for 6-DoF object tracking and high-quality 3D reconstruction from monocular RGBD video. Existing methods, while achieving impressive results, often struggle with complex objects, particularly those exhibiting symmetry, intricate geometry or complex appearance. To bridge these gaps, we introduce an adaptive method that combines 3D Gaussian Splatting, hybrid geometry/appearance tracking, and key frame selection to achieve robust tracking and accurate reconstructions across a diverse range of objects. Additionally, we present a benchmark covering these challenging object classes, providing high-quality annotations for evaluating both tracking and reconstruction performance. Our approach demonstrates strong capabilities in recovering high-fidelity object meshes, setting a new standard for single-sensor 3D reconstruction in open-world environments.
zh

[CV-188] FiGKD: Fine-Grained Knowledge Distillation via High-Frequency Detail Transfer

【速读】:该论文试图解决知识蒸馏(Knowledge Distillation, KD)在细粒度视觉识别任务中表现不佳的问题,其核心在于传统方法将教师模型的输出logits视为单一信号,未能区分其中对学生模型有益的细微语义决策模式与冗余内容,导致学生模型可能因信息过载而无法准确捕捉教师的决策边界。解决方案的关键在于提出一种基于频率感知的细粒度知识蒸馏(Fine-Grained Knowledge Distillation, FiGKD),通过离散小波变换(Discrete Wavelet Transform, DWT)将模型的logits分解为低频(内容)和高频(细节)成分,并仅选择性地传输编码教师语义决策模式的高频成分,从而提升知识迁移的效率与效果。

链接: https://arxiv.org/abs/2505.11897
作者: Seonghak Kim
机构: Agency for Defense Development (国防发展局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures. This work has been submitted to the Elsevier for possible publication

点击查看摘要

Abstract:Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from a high-capacity teacher model to a smaller student model by aligning their output distributions. However, existing methods often underperform in fine-grained visual recognition tasks, where distinguishing subtle differences between visually similar classes is essential. This performance gap stems from the fact that conventional approaches treat the teacher’s output logits as a single, undifferentiated signal-assuming all contained information is equally beneficial to the student. Consequently, student models may become overloaded with redundant signals and fail to capture the teacher’s nuanced decision boundaries. To address this issue, we propose Fine-Grained Knowledge Distillation (FiGKD), a novel frequency-aware framework that decomposes a model’s logits into low-frequency (content) and high-frequency (detail) components using the discrete wavelet transform (DWT). FiGKD selectively transfers only the high-frequency components, which encode the teacher’s semantic decision patterns, while discarding redundant low-frequency content already conveyed through ground-truth supervision. Our approach is simple, architecture-agnostic, and requires no access to intermediate feature maps. Extensive experiments on CIFAR-100, TinyImageNet, and multiple fine-grained recognition benchmarks show that FiGKD consistently outperforms state-of-the-art logit-based and feature-based distillation methods across a variety of teacher-student configurations. These findings confirm that frequency-aware logit decomposition enables more efficient and effective knowledge transfer, particularly in resource-constrained settings.
zh

[CV-189] Adversarial Robustness for Unified Multi-Modal Encoders via Efficient Calibration

【速读】:该论文试图解决统一多模态编码器在对抗性扰动下的鲁棒性问题(adversarial vulnerability),这一问题在安全敏感应用中至关重要。解决方案的关键在于提出一种高效的对抗校准框架,该框架通过引入仅基于对抗样本训练的模态特定投影头,而不修改预训练编码器或语义中心,从而提升模型在不同模态上的鲁棒性。该方法探索了三种训练目标,并采用正则化策略确保攻击下的模态一致性对齐,实验表明其在保持或提升干净数据零样本和检索性能的同时,显著提高了对抗鲁棒性。

链接: https://arxiv.org/abs/2505.11895
作者: Chih-Ting Liao,Bin Ren,Guofeng Mei,Xu Zheng
机构: Mill Research AI(米勒研究人工智能); HKUST(GZ)(香港科技大学(广州)); University of Pisa(比萨大学); University of Trento(特伦托大学); Fondazione Bruno Kessler(布鲁诺·凯塞尔基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent unified multi-modal encoders align a wide range of modalities into a shared representation space, enabling diverse cross-modal tasks. Despite their impressive capabilities, the robustness of these models under adversarial perturbations remains underexplored, which is a critical concern for safety-sensitive applications. In this work, we present the first comprehensive study of adversarial vulnerability in unified multi-modal encoders. We find that even mild adversarial perturbations lead to substantial performance drops across all modalities. Non-visual inputs, such as audio and point clouds, are especially fragile, while visual inputs like images and videos also degrade significantly. To address this, we propose an efficient adversarial calibration framework that improves robustness across modalities without modifying pretrained encoders or semantic centers, ensuring compatibility with existing foundation models. Our method introduces modality-specific projection heads trained solely on adversarial examples, while keeping the backbone and embeddings frozen. We explore three training objectives: fixed-center cross-entropy, clean-to-adversarial L2 alignment, and clean-adversarial InfoNCE, and we introduce a regularization strategy to ensure modality-consistent alignment under attack. Experiments on six modalities and three Bind-style models show that our method improves adversarial robustness by up to 47.3 percent at epsilon = 4/255, while preserving or even improving clean zero-shot and retrieval performance with less than 1 percent trainable parameters.
zh

[CV-190] Facial Recognition Leverag ing Generative Adversarial Networks

【速读】:该论文试图解决基于深度学习的人脸识别性能对大规模训练数据的高度依赖问题,这一问题在实际应用中往往难以克服。其解决方案的关键在于提出一种基于生成对抗网络(Generative Adversarial Network, GAN)的数据增强方法,该方法包含三个核心贡献:一是引入残差嵌入生成器以缓解梯度消失/爆炸问题,二是采用基于Inception ResNet-V1的FaceNet判别器以提升对抗训练效果,三是构建端到端框架联合优化数据生成与识别性能。

链接: https://arxiv.org/abs/2505.11884
作者: Zhongwen Li,Zongwei Li,Xiaoqi Li
机构: Hainan University (海南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Face recognition performance based on deep learning heavily relies on large-scale training data, which is often difficult to acquire in practical applications. To address this challenge, this paper proposes a GAN-based data augmentation method with three key contributions: (1) a residual-embedded generator to alleviate gradient vanishing/exploding problems, (2) an Inception ResNet-V1 based FaceNet discriminator for improved adversarial training, and (3) an end-to-end framework that jointly optimizes data generation and recognition performance. Experimental results demonstrate that our approach achieves stable training dynamics and significantly improves face recognition accuracy by 12.7% on the LFW benchmark compared to baseline methods, while maintaining good generalization capability with limited training samples.
zh

[CV-191] MINGLE: Mixtures of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging

【速读】:该论文旨在解决持续模型融合(continual model merging)中的关键挑战,即任务间的参数干扰导致的灾难性遗忘以及对动态测试分布适应能力有限的问题。其解决方案的核心是提出MINGLE框架,该框架通过利用当前任务的小规模无标签测试样本进行测试时适应,动态引导融合过程。MINGLE采用由参数高效、低秩专家组成的专家混合架构,以实现高效适应并增强对分布偏移的鲁棒性,同时引入Null-Space Constrained Gating和Adaptive Relaxation Strategy来抑制旧任务的激活并平衡模型的稳定性和适应性。

链接: https://arxiv.org/abs/2505.11883
作者: Zihuan Qiu,Yi Xu,Chiyuan He,Fanman Meng,Linfeng Xu,Qingbo Wu,Hongliang Li
机构: University of Electronic Science and Technology of China (中国电子科技大学); Dalian University of Technology (大连理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual model merging integrates independently fine-tuned models sequentially without access to original training data, providing a scalable and efficient solution to continual learning. However, current methods still face critical challenges, notably parameter interference among tasks and limited adaptability to evolving test distributions. The former causes catastrophic forgetting of integrated tasks, while the latter hinders effective adaptation to new tasks. To address these, we propose MINGLE, a novel framework for test-time continual model merging, which leverages test-time adaptation using a small set of unlabeled test samples from the current task to dynamically guide the merging process. MINGLE employs a mixture-of-experts architecture composed of parameter-efficient, low-rank experts, enabling efficient adaptation and improving robustness to distribution shifts. To mitigate catastrophic forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations. This suppresses activations on old task inputs and preserves model behavior on past tasks. To further balance stability and adaptability, we design an Adaptive Relaxation Strategy, which dynamically adjusts the constraint strength based on interference signals captured during test-time adaptation. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, reduces forgetting significantly, and consistently surpasses previous state-of-the-art methods by 7-9% on average across diverse task orders.
zh

[CV-192] GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder ICML’25

【速读】:该论文旨在解决生成式零样本学习(Generative Zero-Shot Learning, GZSL)中生成性能不足和场景泛化能力有限的问题。现有方法依赖于专家标注的强语义向量生成视觉特征,导致生成效果不佳。其解决方案的关键在于提出一种归纳变分自编码器(Inductive Variational Autoencoder for Generative Zero-Shot Learning, GenZSL),通过从相似已见类别中归纳新类别样本,并利用目标类别名称提取的弱语义向量(如CLIP文本嵌入)进行引导,从而提升生成样本的质量与多样性。此外,GenZSL引入了类多样性促进策略和目标类别引导的信息增强准则,以优化模型性能。

链接: https://arxiv.org/abs/2505.11882
作者: Shiming Chen,Dingjie Fu,Salman Khan,Fahad Shahbaz Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICML’25

点击查看摘要

Abstract:Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than 60\times faster training speed on AWA2. Codes are available at this https URL.
zh

[CV-193] Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks

【速读】:该论文试图解决标准残差更新中模块输出直接叠加到输入流导致的特征学习受限问题,这可能使模块未能充分挖掘学习全新特征的能力。解决方案的关键在于引入正交残差更新:将模块输出相对于输入流进行分解,并仅添加与该流正交的分量,从而引导模块主要贡献新的表征方向,提升特征学习的丰富性和训练效率。

链接: https://arxiv.org/abs/2505.11881
作者: Giyeong Oh,Woohyun Cho,Siyeol Kim,Suhwan Choi,Younjae Yu
机构: Yonsei University (延世大学); Maum.AI (Maum.AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, WIP

点击查看摘要

Abstract:Residual connections are pivotal for deep neural networks, enabling greater depth by mitigating vanishing gradients. However, in standard residual updates, the module’s output is directly added to the input stream. This can lead to updates that predominantly reinforce or modulate the existing stream direction, potentially underutilizing the module’s capacity for learning entirely novel features. In this work, we introduce Orthogonal Residual Update: we decompose the module’s output relative to the input stream and add only the component orthogonal to this stream. This design aims to guide modules to contribute primarily new representational directions, fostering richer feature learning while promoting more efficient training. We demonstrate that our orthogonal update strategy improves generalization accuracy and training stability across diverse architectures (ResNetV2, Vision Transformers) and datasets (CIFARs, TinyImageNet, ImageNet-1k), achieving, for instance, a +4.3%p top-1 accuracy gain for ViT-B on ImageNet-1k.
zh

[CV-194] Experimental Study on Automatically Assembling Custom Catering Packages With a 3-DOF Delta Robot Using Deep Learning Methods

【速读】:该论文旨在解决餐饮包装中使用二指夹爪与三自由度Delta并联机器人实现自动化打包的问题,其关键在于应用深度学习方法进行物体检测与分割,并结合几何学方法计算抓取点。研究构建了一个包含1,500张图像的定制数据集,采用YOLOv5进行目标检测,FastSAM进行分割,随后通过分割掩码计算旋转角度并生成包围框,最终利用特征向量提出一种新的几何方法确定抓取点,从而实现高成功率的自主抓取与包装。

链接: https://arxiv.org/abs/2505.11879
作者: Reihaneh Yourdkhani,Arash Tavoosian,Navid Asadi Khomami,Mehdi Tale Masouleh
机构: University of Tehran (德黑兰大学); Electrical and Computer Engineering Department (电气与计算机工程系); Mechanical Engineering Department (机械工程系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a pioneering experimental study on the automated packing of a catering package using a two-fingered gripper affixed to a 3-degree-of-freedom Delta parallel robot. A distinctive contribution lies in the application of a deep learning approach to tackle this challenge. A custom dataset, comprising 1,500 images, is meticulously curated for this endeavor, representing a noteworthy initiative as the first dataset focusing on Persian-manufactured products. The study employs the YOLOV5 model for object detection, followed by segmentation using the FastSAM model. Subsequently, rotation angle calculation is facilitated with segmentation masks, and a rotated rectangle encapsulating the object is generated. This rectangle forms the basis for calculating two grasp points using a novel geometrical approach involving eigenvectors. An extensive experimental study validates the proposed model, where all pertinent information is seamlessly transmitted to the 3-DOF Delta parallel robot. The proposed algorithm ensures real-time detection, calibration, and the fully autonomous packing process of a catering package, boasting an impressive over 80% success rate in automatic grasping. This study marks a significant stride in advancing the capabilities of robotic systems for practical applications in packaging automation.
zh

[CV-195] PRS-Med: Position Reasoning Segmentation with Vision-Language Model in Medical Imaging

【速读】:该论文旨在解决医学影像分割中医生通过自然语言交互或需要位置推理时的挑战,即如何准确理解解剖结构与病灶之间的空间关系。解决方案的关键在于提出PRS-Med框架,该框架将视觉-语言模型与分割能力相结合,能够生成精确的分割掩码及对应的空间推理输出。同时,论文引入了MMRS数据集(Multimodal Medical in Positional Reasoning Segmentation),提供多样化的、基于空间的问答对,以弥补医学影像中位置推理数据的不足。

链接: https://arxiv.org/abs/2505.11872
作者: Quoc-Huy Trinh,Minh-Van Nguyen,Jung Peng,Ulas Bagci,Debesh Jha
机构: Aalto University (阿尔托大学); Denmark Technical University (丹麦技术大学); Northwestern University (西北大学); University of South Dakota (南达科他大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in prompt-based medical image segmentation have enabled clinicians to identify tumors using simple input like bounding boxes or text prompts. However, existing methods face challenges when doctors need to interact through natural language or when position reasoning is required - understanding spatial relationships between anatomical structures and pathologies. We present PRS-Med, a framework that integrates vision-language models with segmentation capabilities to generate both accurate segmentation masks and corresponding spatial reasoning outputs. Additionally, we introduce the MMRS dataset (Multimodal Medical in Positional Reasoning Segmentation), which provides diverse, spatially-grounded question-answer pairs to address the lack of position reasoning data in medical imaging. PRS-Med demonstrates superior performance across six imaging modalities (CT, MRI, X-ray, ultrasound, endoscopy, RGB), significantly outperforming state-of-the-art methods in both segmentation accuracy and position reasoning. Our approach enables intuitive doctor-system interaction through natural language, facilitating more efficient diagnoses. Our dataset pipeline, model, and codebase will be released to foster further research in spatially-aware multimodal reasoning for medical applications.
zh

[CV-196] MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos

【速读】:该论文旨在解决动态环境中准确分析运动部件及其运动属性的问题,特别是在传统方法依赖密集多视角图像或详细部件级标注的情况下所存在的局限性。其解决方案的关键在于提出一种创新框架,能够从单目视频中以零样本方式分析三维运动,该框架通过结合深度估计、光流分析和点云配准方法构建场景几何并初步分析运动部件及其运动属性,随后利用2D高斯点绘制进行场景表示,并引入端到端的动态场景优化算法,专门针对刚体对象进行优化,从而实现对旋转、平移及复合运动的精确解析,无需任何标注训练数据。

链接: https://arxiv.org/abs/2505.11868
作者: Hongyi Zhou,Xiaogang Wang,Yulan Guo,Kai Xu
机构: National University of Defense Technology (国防科技大学); Southwest University (西南大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately analyzing the motion parts and their motion attributes in dynamic environments is crucial for advancing key areas such as embodied intelligence. Addressing the limitations of existing methods that rely on dense multi-view images or detailed part-level annotations, we propose an innovative framework that can analyze 3D mobility from monocular videos in a zero-shot manner. This framework can precisely parse motion parts and motion attributes only using a monocular video, completely eliminating the need for annotated training data. Specifically, our method first constructs the scene geometry and roughly analyzes the motion parts and their initial motion attributes combining depth estimation, optical flow analysis and point cloud registration method, then employs 2D Gaussian splatting for scene representation. Building on this, we introduce an end-to-end dynamic scene optimization algorithm specifically designed for articulated objects, refining the initial analysis results to ensure the system can handle ‘rotation’, ‘translation’, and even complex movements (‘rotation+translation’), demonstrating high flexibility and versatility. To validate the robustness and wide applicability of our method, we created a comprehensive dataset comprising both simulated and real-world scenarios. Experimental results show that our framework can effectively analyze articulated object motions in an annotation-free manner, showcasing its significant potential in future embodied intelligence applications.
zh

[CV-197] GLOVER: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation

【速读】:该论文旨在解决从人类演示视频中学习可泛化且可解释的机器人智能的问题,特别是针对可操作性(affordance)知识的迁移挑战。现有问题主要体现在两个方面:一是缺乏大规模且具有精确可操作性标注的数据集,二是对多样化的操作场景中可操作性的探索不足。解决方案的关键在于引入HOVA-500K数据集,这是一个包含500,000张图像、1,726个物体类别和675种动作的大规模可操作性标注数据集,并提出GLOVER++框架,该框架通过全局到局部的可操作性训练策略,有效将人类演示中的可操作性知识迁移到下游的开放词汇推理任务中。

链接: https://arxiv.org/abs/2505.11865
作者: Teli Ma,Jia Zheng,Zifan Wang,Ziyao Gao,Jiaming Zhou,Junwei Liang
机构: HKUST (GZ) (香港科技大学(广州)); HKUST (香港科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence-particularly through the lens of actionable affordances. However, transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities.
zh

[CV-198] MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

【速读】:该论文旨在解决医学图像序列中细粒度语义对齐与上下文感知推理不足的问题,特别是在多模态大语言模型(MLLMs)中实现精准的病变定位与疾病进展的时间序列跟踪。现有医学视觉定位基准主要关注单图场景,而实际临床应用中常涉及序列图像,因此需要更精确的跨图像语义对齐能力。其解决方案的关键在于提出MedSG-Bench,首个针对医学图像序列定位的基准,包含八种VQA风格任务,分为图像差异定位和图像一致性定位两大类,并构建了大规模指令微调数据集MedSG-188K及专门设计的MedSeq-Grounder模型,以推动医学序列图像细粒度理解的研究。

链接: https://arxiv.org/abs/2505.11852
作者: Jingkun Yue,Siqi Zhang,Zinan Jia,Huihuan Xu,Zongbo Han,Xiaohong Liu,Guangyu Wang
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Tianjin University(天津大学); South China Hospital, Medical School, Shenzhen University(华南医院,医学院,深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question-answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. The benchmark, dataset, and model are available at this https URL
zh

[CV-199] ElderFallGuard: Real-Time IoT and Computer Vision-Based Fall Detection System for Elderly Safety

【速读】:该论文旨在解决老年人跌倒带来的严重伤害和独立性丧失问题,提出了一种基于计算机视觉的物联网(IoT)解决方案——ElderFallGuard,用于实时检测跌倒并通知照护者。其关键在于利用MediaPipe进行人体姿态估计,并通过自定义数据集训练机器学习分类器,最终选用随机森林(Random Forest)模型实现高精度的跌倒检测,同时结合特定的检测逻辑与冷却机制,确保及时且有效的警报系统。

链接: https://arxiv.org/abs/2505.11845
作者: Tasrifur Riahi,Md. Azizul Hakim Bappy,Md. Mehedi Islam
机构: Institute of Information and Communicaton Technology, Bangladesh University of Engineering Technology, Dhaka, bangladesh; Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 page, 1 table, 5 figure

点击查看摘要

Abstract:For the elderly population, falls pose a serious and increasing risk of serious injury and loss of independence. In order to overcome this difficulty, we present ElderFallGuard: A Computer Vision Based IoT Solution for Elderly Fall Detection and Notification, a cutting-edge, non-invasive system intended for quick caregiver alerts and real-time fall detection. Our approach leverages the power of computer vision, utilizing MediaPipe for accurate human pose estimation from standard video streams. We developed a custom dataset comprising 7200 samples across 12 distinct human poses to train and evaluate various machine learning classifiers, with Random Forest ultimately selected for its superior performance. ElderFallGuard employs a specific detection logic, identifying a fall when a designated prone pose (“Pose6”) is held for over 3 seconds coupled with a significant drop in motion detected for more than 2 seconds. Upon confirmation, the system instantly dispatches an alert, including a snapshot of the event, to a designated Telegram group via a custom bot, incorporating cooldown logic to prevent notification overload. Rigorous testing on our dataset demonstrated exceptional results, achieving 100% accuracy, precision, recall, and F1-score. ElderFallGuard offers a promising, vision-based IoT solution to enhance elderly safety and provide peace of mind for caregivers through intelligent, timely alerts.
zh

[CV-200] RVTBench: A Benchmark for Visual Reasoning Tasks

【速读】:该论文旨在解决深度学习模型在视觉推理(Visual Reasoning)任务中面临的挑战,尤其是由于缺乏相关基准测试而导致的多步骤推理能力不足问题。现有研究主要集中在视觉分割任务上,而未能全面覆盖多样化的视觉语言推理问题。该论文提出了一种统一的视觉推理任务(Reasoning Visual Tasks, RVTs)框架,扩展了传统视频推理分割任务的范围,并支持多种输出形式。其解决方案的关键在于构建一个基于数字孪生(Digital Twin, DT)表示的自动化RVT基准测试生成管道,通过结构化中间表示连接感知与隐式文本查询生成,从而更准确地捕捉复杂时空关系和多步骤推理链,解决了依赖大型语言模型(LLMs)构建基准时存在的推理复杂度受限问题。

链接: https://arxiv.org/abs/2505.11838
作者: Yiqing Shen,Chenjia Li,Chenxiao Fan,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual reasoning, the capability to interpret visual input in response to implicit text query through multi-step reasoning, remains a challenge for deep learning models due to the lack of relevant benchmarks. Previous work in visual reasoning has primarily focused on reasoning segmentation, where models aim to segment objects based on implicit text queries. This paper introduces reasoning visual tasks (RVTs), a unified formulation that extends beyond traditional video reasoning segmentation to a diverse family of visual language reasoning problems, which can therefore accommodate multiple output formats including bounding boxes, natural language descriptions, and question-answer pairs. Correspondingly, we identify the limitations in current benchmark construction methods that rely solely on large language models (LLMs), which inadequately capture complex spatial-temporal relationships and multi-step reasoning chains in video due to their reliance on token representation, resulting in benchmarks with artificially limited reasoning complexity. To address this limitation, we propose a novel automated RVT benchmark construction pipeline that leverages digital twin (DT) representations as structured intermediaries between perception and the generation of implicit text queries. Based on this method, we construct RVTBench, a RVT benchmark containing 3,896 queries of over 1.2 million tokens across four types of RVT (segmentation, grounding, VQA and summary), three reasoning categories (semantic, spatial, and temporal), and four increasing difficulty levels, derived from 200 video sequences. Finally, we propose RVTagent, an agent framework for RVT that allows for zero-shot generalization across various types of RVT without task-specific fine-tuning.
zh

[CV-201] CoT-Vid: Dynamic Chain-of-Thought Routing with Self Verification for Training-Free Video Reasoning

【速读】:该论文旨在解决当前复杂视频推理研究中的不足,特别是在视频领域中缺乏有效的系统性推理方法。其解决方案的关键在于提出一种无需训练的范式CoT-Vid,该范式通过多阶段复杂推理设计实现性能提升,核心组件包括动态推理路径路由、问题解耦策略和视频自洽性验证,相较于依赖感知能力的现有视频大语言模型(Large Language Models),CoT-Vid通过显式的推理机制取得了显著的性能提升。

链接: https://arxiv.org/abs/2505.11830
作者: Hongbo Jin,Ruyang Liu,Wenhao Zhang,Guibo Luo,Ge Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:System2 reasoning is developing rapidly these days with the emergence of Deep- Thinking Models and chain-of-thought technology, which has become a centralized discussion point in the AI community. However, there is a relative gap in the research on complex video reasoning at present. In this work, we propose CoT-Vid, a novel training-free paradigm for the video domain with a multistage complex reasoning design. Distinguishing from existing video LLMs, which rely heavily on perceptual abilities, it achieved surprising performance gain with explicit reasoning mechanism. The paradigm consists of three main components: dynamic inference path routing, problem decoupling strategy, and video self-consistency verification. In addition, we propose a new standard for categorization of video questions. CoT- Vid showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models, such as GPT-4V, GPT-4o and Gemini-1.5-flash. Our codebase will be publicly available soon.
zh

[CV-202] Bootstrapping Diffusion: Diffusion Model Training Leverag ing Partial and Corrupted Data

【速读】:该论文试图解决在训练扩散模型时,如何有效利用部分或损坏数据(如低分辨率图像、短视频、含字幕或水印的视频等)的问题。其解决方案的关键在于将每种互补数据视为常规数据的一种视图,并分别训练每个视图的扩散模型,随后训练一个用于预测残差得分函数(residual score function)的模型。通过引入适当的正则化,该方法能够实现较低的泛化误差,并证明残差得分函数的训练难度与部分数据视图未捕捉到的信号相关性成正比,从而实现了接近一阶最优的数据效率。

链接: https://arxiv.org/abs/2505.11825
作者: Xudong Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 1 figure

点击查看摘要

Abstract:Training diffusion models requires large datasets. However, acquiring large volumes of high-quality data can be challenging, for example, collecting large numbers of high-resolution images and long videos. On the other hand, there are many complementary data that are usually considered corrupted or partial, such as low-resolution images and short videos. Other examples of corrupted data include videos that contain subtitles, watermarks, and logos. In this study, we investigate the theoretical problem of whether the above partial data can be utilized to train conventional diffusion models. Motivated by our theoretical analysis in this study, we propose a straightforward approach of training diffusion models utilizing partial data views, where we consider each form of complementary data as a view of conventional data. Our proposed approach first trains one separate diffusion model for each individual view, and then trains a model for predicting the residual score function. We prove generalization error bounds, which show that the proposed diffusion model training approach can achieve lower generalization errors if proper regularizations are adopted in the residual score function training. In particular, we prove that the difficulty in training the residual score function scales proportionally with the signal correlations not captured by partial data views. Consequently, the proposed approach achieves near first-order optimal data efficiency.
zh

[CV-203] Robust Cross-View Geo-Localization via Content-Viewpoint Disentanglement

【速读】:该论文旨在解决跨视角地理定位(Cross-view geo-localization, CVGL)中的挑战,即在不同视角(如无人机和卫星图像)下匹配同一地理位置图像时,由于外观变化和空间失真导致的特征不一致性问题。现有方法通常假设跨视角图像可通过对比学习在共享特征空间中直接对齐,但这一假设忽略了视角差异引起的固有冲突,导致提取的特征包含不一致信息,影响定位精度。该研究的关键解决方案是提出一种新的CVGL框架CVD,其核心在于通过流形学习视角,将跨视角图像的特征空间建模为由内容和视角信息共同驱动的复合流形,并显式解耦内容与视角因素,通过引入内部视角独立性和外部视角重建约束,确保特征的语义一致性与可迁移性。

链接: https://arxiv.org/abs/2505.11822
作者: Ke Li,Di Wang,Xiaowei Wang,Zhihong Wu,Yiming Zhang,Yifeng Wang,Quan Wang
机构: Xidian University (西安电子科技大学); University of California San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) aims to match images of the same geographic location captured from different perspectives, such as drones and satellites. Despite recent advances, CVGL remains highly challenging due to significant appearance changes and spatial distortions caused by viewpoint variations. Existing methods typically assume that cross-view images can be directly aligned within a shared feature space by maximizing feature similarity through contrastive learning. Nonetheless, this assumption overlooks the inherent conflicts induced by viewpoint discrepancies, resulting in extracted features containing inconsistent information that hinders precise localization. In this study, we take a manifold learning perspective and model the feature space of cross-view images as a composite manifold jointly governed by content and viewpoint information. Building upon this insight, we propose \textbfCVD , a new CVGL framework that explicitly disentangles \textitcontent and \textitviewpoint factors. To promote effective disentanglement, we introduce two constraints: \textit(i) An intra-view independence constraint, which encourages statistical independence between the two factors by minimizing their mutual information. \textit(ii) An inter-view reconstruction constraint that reconstructs each view by cross-combining \textitcontent and \textitviewpoint from paired images, ensuring factor-specific semantics are preserved. As a plug-and-play module, CVD can be seamlessly integrated into existing geo-localization pipelines. Extensive experiments on four benchmarks, i.e., University-1652, SUES-200, CVUSA, and CVACT, demonstrate that CVD consistently improves both localization accuracy and generalization across multiple baselines.
zh

[CV-204] Continuous Subspace Optimization for Continual Learning

【速读】:该论文旨在解决持续学习(continual learning)中因学习新任务而导致的灾难性遗忘问题。现有方法通常通过低秩适应(low-rank adaptation)调整预训练模型,但这种限制优化空间的方式会降低模型的学习能力。论文提出的解决方案关键在于连续子空间优化(Continuous Subspace Optimization, CoSO),其通过梯度奇异值分解动态确定一系列子空间,并在这些子空间中进行模型微调,从而实现记忆高效的优化。此外,为缓解遗忘,每个任务的优化子空间被设置为与历史任务子空间正交,同时维护一个任务特定组件以捕获当前任务的关键更新方向,进而更新历史任务子空间,为后续学习提供基础。

链接: https://arxiv.org/abs/2505.11816
作者: Quan Cheng,Yuanyu Wan,Lingyu Wu,Chenping Hou,Lijun Zhang
机构: Nanjing University(南京大学); Zhejiang University(浙江大学); National University of Defense Technology(国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual learning aims to learn multiple tasks sequentially while preserving prior knowledge, but faces the challenge of catastrophic forgetting when acquiring new knowledge. Recently, approaches leveraging pre-trained models have gained increasing popularity to mitigate this issue, due to the strong generalization ability of foundation models. To adjust pre-trained models for new tasks, existing methods usually employ low-rank adaptation, which restricts parameter updates to a fixed low-rank subspace. However, constraining the optimization space inherently compromises the model’s learning capacity, resulting in inferior performance. To address the limitation, we propose Continuous Subspace Optimization for Continual Learning (CoSO) to fine-tune the model in a series of subspaces rather than a single one. These sequential subspaces are dynamically determined through the singular value decomposition of gradients. CoSO updates the model by projecting gradients into these subspaces, ensuring memory-efficient optimization. To mitigate forgetting, the optimization subspaces of each task are set to be orthogonal to the historical task subspace. During task learning, CoSO maintains a task-specific component that captures the critical update directions associated with the current task. Upon completing a task, this component is used to update the historical task subspace, laying the groundwork for subsequent learning. Extensive experiments on multiple datasets demonstrate that CoSO significantly outperforms state-of-the-art methods, especially in challenging scenarios with long task sequences.
zh

[CV-205] UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings

【速读】:该论文试图解决现有视觉-语言模型在处理多样化的模态组合(如文本与图像到文本、文本与图像到文本和图像、文本到文本和图像等)时所面临的对齐问题,这些问题导致模型在推理阶段性能下降。解决方案的关键在于提出一种名为UniMoCo的新架构,其核心是引入一个模态补全模块,该模块能够从文本输入生成视觉特征,从而确保查询和目标的模态完整性;同时,采用一种专门的训练策略,以对齐原始输入和模态补全后的输入的嵌入表示,确保嵌入空间内的一致性。

链接: https://arxiv.org/abs/2505.11815
作者: Jiajun Qin,Yuan Pu,Zhuolun He,Seunggeun Kim,David Z. Pan,Bei Yu
机构: Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学); ChatEDA Tech; University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current research has explored vision-language models for multi-modal embedding tasks, such as information retrieval, visual grounding, and classification. However, real-world scenarios often involve diverse modality combinations between queries and targets, such as text and image to text, text and image to text and image, and text to text and image. These diverse combinations pose significant challenges for existing models, as they struggle to align all modality combinations within a unified embedding space during training, which degrades performance at inference. To address this limitation, we propose UniMoCo, a novel vision-language model architecture designed for multi-modal embedding tasks. UniMoCo introduces a modality-completion module that generates visual features from textual inputs, ensuring modality completeness for both queries and targets. Additionally, we develop a specialized training strategy to align embeddings from both original and modality-completed inputs, ensuring consistency within the embedding space. This enables the model to robustly handle a wide range of modality combinations across embedding tasks. Experiments show that UniMoCo outperforms previous methods while demonstrating consistent robustness across diverse settings. More importantly, we identify and quantify the inherent bias in conventional approaches caused by imbalance of modality combinations in training data, which can be mitigated through our modality-completion paradigm. The code is available at this https URL.
zh

[CV-206] SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation

【速读】:该论文试图解决领域特定图像分类任务中数据增强难以同时兼顾生成数据的多样性(diversity)、真实性(faithfulness)和标签清晰度(label clarity)的问题,从而导致下游任务性能不佳。其解决方案的关键在于提出一种新颖框架,该框架将多样性、真实性与标签清晰度显式地整合到增强过程中,通过显著性引导的混合方法和微调的扩散模型来保持前景语义、丰富背景多样性并确保标签一致性,同时缓解扩散模型的局限性。

链接: https://arxiv.org/abs/2505.11813
作者: Yixuan Dong,Fang-Yi Su,Jung-Hsien Chiang
机构: National Cheng Kung University (国立成功大学); Harvard Medical School (哈佛医学院); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Data augmentation for domain-specific image classification tasks often struggles to simultaneously address diversity, faithfulness, and label clarity of generated data, leading to suboptimal performance in downstream tasks. While existing generative diffusion model-based methods aim to enhance augmentation, they fail to cohesively tackle these three critical aspects and often overlook intrinsic challenges of diffusion models, such as sensitivity to model characteristics and stochasticity under strong transformations. In this paper, we propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process. Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency, while mitigating diffusion model limitations. Extensive experiments across fine-grained, long-tail, few-shot, and background robustness tasks demonstrate our method’s superior performance over state-of-the-art approaches.
zh

[CV-207] Image-based Visibility Analysis Replacing Line-of-Sight Simulation: An Urban Landmark Perspective

【速读】:该论文试图解决传统基于视线(Line-of-Sight, LoS)的可视性分析方法在评估城市地标等命名物体可视性时,无法充分捕捉真实世界中上下文和感知维度的问题。解决方案的关键在于引入一种基于图像的可视性分析方法,利用视觉语言模型(Vision Language Model, VLM)在方向缩放的街景图像(Street View Image, SVI)中检测目标物体,从而表征其在对应位置的可视性,并构建异构可视性图以揭示观察者与目标物体之间的复杂交互关系。

链接: https://arxiv.org/abs/2505.11809
作者: Zicheng Fan,Kunihiko Fujiwara,Pengyuan Liu,Fan Zhang,Filip Biljecki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visibility analysis is one of the fundamental analytics methods in urban planning and landscape research, traditionally conducted through computational simulations based on the Line-of-Sight (LoS) principle. However, when assessing the visibility of named urban objects such as landmarks, geometric intersection alone fails to capture the contextual and perceptual dimensions of visibility as experienced in the real world. The study challenges the traditional LoS-based approaches by introducing a new, image-based visibility analysis method. Specifically, a Vision Language Model (VLM) is applied to detect the target object within a direction-zoomed Street View Image (SVI). Successful detection represents the object’s visibility at the corresponding SVI location. Further, a heterogeneous visibility graph is constructed to address the complex interaction between observers and target objects. In the first case study, the method proves its reliability in detecting the visibility of six tall landmark constructions in global cities, with an overall accuracy of 87%. Furthermore, it reveals broader contextual differences when the landmarks are perceived and experienced. In the second case, the proposed visibility graph uncovers the form and strength of connections for multiple landmarks along the River Thames in London, as well as the places where these connections occur. Notably, bridges on the River Thames account for approximately 30% of total connections. Our method complements and enhances traditional LoS-based visibility analysis, and showcases the possibility of revealing the prevalent connection of any visual objects in the urban environment. It opens up new research perspectives for urban planning, heritage conservation, and computational social science.
zh

[CV-208] Are vision language models robust to uncertain inputs?

【速读】:该论文试图解决深度学习模型在面对不确定和模糊输入时的鲁棒性问题(robustness against uncertain and ambiguous inputs)。其解决方案的关键在于通过提示模型在不确定时进行回避(abstain from uncertain predictions)以提升可靠性,特别是在自然图像任务中,这一方法能够显著提高模型的鲁棒性。此外,论文还提出了一种基于描述多样性(caption diversity)的新机制,用于揭示模型内部的不确定性,从而帮助从业者预测模型是否能在无需标注数据的情况下成功回避错误预测。

链接: https://arxiv.org/abs/2505.11804
作者: Xi Wang,Eric Nalisnick
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. Testing models using two classic uncertainty quantification tasks, anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs. Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from uncertain predictions enables significant reliability gains, achieving near-perfect robustness in several settings. However, for domain-specific tasks such as galaxy morphology classification, a lack of specialized knowledge prevents reliable uncertainty estimation. Finally, we propose a novel mechanism based on caption diversity to reveal a model’s internal uncertainty, enabling practitioners to predict when models will successfully abstain without relying on labeled data.
zh

[CV-209] Self-Learning Hyperspectral and Multispectral Image Fusion via Adaptive Residual Guided Subspace Diffusion Model CVPR

【速读】:该论文旨在解决高光谱与多光谱图像(HSI-MSI)融合中因缺乏大量标注的高光谱数据而导致的监督学习方法难以应用的问题。其解决方案的关键在于提出一种自学习的自适应残差引导子空间扩散模型(ARGS-Diff),该模型仅利用观测到的图像进行训练,无需额外的数据。该方法通过设计两个轻量级的光谱和空间扩散模型分别学习光谱和空间分布,并在反向扩散过程中通过自适应残差引导模块(ARGM)对低维成分进行优化,从而实现高效且高质量的高分辨率高光谱图像重建。

链接: https://arxiv.org/abs/2505.11800
作者: Jian Zhu,He Wang,Yang Xu,Zebin Wu,Zhihui Wei
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: cvpr

点击查看摘要

Abstract:Hyperspectral and multispectral image (HSI-MSI) fusion involves combining a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI) to generate a high-resolution hyperspectral image (HR-HSI). Most deep learning-based methods for HSI-MSI fusion rely on large amounts of hyperspectral data for supervised training, which is often scarce in practical applications. In this paper, we propose a self-learning Adaptive Residual Guided Subspace Diffusion Model (ARGS-Diff), which only utilizes the observed images without any extra training data. Specifically, as the LR-HSI contains spectral information and the HR-MSI contains spatial information, we design two lightweight spectral and spatial diffusion models to separately learn the spectral and spatial distributions from them. Then, we use these two models to reconstruct HR-HSI from two low-dimensional components, i.e, the spectral basis and the reduced coefficient, during the reverse diffusion process. Furthermore, we introduce an Adaptive Residual Guided Module (ARGM), which refines the two components through a residual guided function at each sampling step, thereby stabilizing the sampling process. Extensive experimental results demonstrate that ARGS-Diff outperforms existing state-of-the-art methods in terms of both performance and computational efficiency in the field of HSI-MSI fusion. Code is available at this https URL.
zh

[CV-210] CL-BioGAN: Biologically-Inspired Cross-Domain Continual Learning for Hyperspectral Anomaly Detection

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中记忆稳定性与学习灵活性的矛盾问题,特别是在跨场景高光谱异常检测(HAD)任务中的应用。其解决方案的关键在于提出一种受生物学启发的持续学习生成对抗网络(CL-BioGAN),通过引入生物启发的持续学习损失(CL-Bio Loss)和自注意力生成对抗网络(BioGAN),实现对历史知识的主动遗忘以及回放策略的整合,从而在参数释放与任务间增强方面从贝叶斯视角优化模型性能。

链接: https://arxiv.org/abs/2505.11796
作者: Jianing Wang,Zheng Hua,Wan Zhang,Shengjia Hao,Yuqiong Yao,Maoguo Gong
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Memory stability and learning flexibility in continual learning (CL) is a core challenge for cross-scene Hyperspectral Anomaly Detection (HAD) task. Biological neural networks can actively forget history knowledge that conflicts with the learning of new experiences by regulating learning-triggered synaptic expansion and synaptic convergence. Inspired by this phenomenon, we propose a novel Biologically-Inspired Continual Learning Generative Adversarial Network (CL-BioGAN) for augmenting continuous distribution fitting ability for cross-domain HAD task, where Continual Learning Bio-inspired Loss (CL-Bio Loss) and self-attention Generative Adversarial Network (BioGAN) are incorporated to realize forgetting history knowledge as well as involving replay strategy in the proposed BioGAN. Specifically, a novel Bio-Inspired Loss composed with an Active Forgetting Loss (AF Loss) and a CL loss is designed to realize parameters releasing and enhancing between new task and history tasks from a Bayesian perspective. Meanwhile, BioGAN loss with L2-Norm enhances self-attention (SA) to further balance the stability and flexibility for better fitting background distribution for open scenario HAD (OHAD) tasks. Experiment results underscore that the proposed CL-BioGAN can achieve more robust and satisfying accuracy for cross-domain HAD with fewer parameters and computation cost. This dual contribution not only elevates CL performance but also offers new insights into neural adaptation mechanisms in OHAD task.
zh

[CV-211] CL-CaGAN: Capsule differential adversarial continuous learning for cross-domain hyperspectral anomaly detection

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)中异常检测(Anomaly Detection, AD)任务在跨场景应用时面临的先验信息不足和灾难性遗忘(Catastrophic Forgetting)问题。其关键解决方案是提出一种基于持续学习的胶囊差异生成对抗网络(Continual Learning-based Capsule Differential Generative Adversarial Network, CL-CaGAN),通过改进的胶囊结构结合对抗学习以估计背景分布,同时采用基于聚类的样本回放策略与自蒸馏正则化来缓解灾难性遗忘,并通过可微增强提升生成性能,从而实现跨场景下的高效学习与稳定检测。

链接: https://arxiv.org/abs/2505.11793
作者: Jianing Wang,Siying Guo,Zheng Hua,Runhu Huang,Jinyu Hu,Maoguo Gong
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Anomaly detection (AD) has attracted remarkable attention in hyperspectral image (HSI) processing fields, and most existing deep learning (DL)-based algorithms indicate dramatic potential for detecting anomaly samples through specific training process under current scenario. However, the limited prior information and the catastrophic forgetting problem indicate crucial challenges for existing DL structure in open scenarios cross-domain detection. In order to improve the detection performance, a novel continual learning-based capsule differential generative adversarial network (CL-CaGAN) is proposed to elevate the cross-scenario learning performance for facilitating the real application of DL-based structure in hyperspectral AD (HAD) task. First, a modified capsule structure with adversarial learning network is constructed to estimate the background distribution for surmounting the deficiency of prior information. To mitigate the catastrophic forgetting phenomenon, clustering-based sample replay strategy and a designed extra self-distillation regularization are integrated for merging the history and future knowledge in continual AD task, while the discriminative learning ability from previous detection scenario to current scenario is retained by the elaborately designed structure with continual learning (CL) strategy. In addition, the differentiable enhancement is enforced to augment the generation performance of the training data. This further stabilizes the training process with better convergence and efficiently consolidates the reconstruction ability of background samples. To verify the effectiveness of our proposed CL-CaGAN, we conduct experiments on several real HSIs, and the results indicate that the proposed CL-CaGAN demonstrates higher detection performance and continuous learning capacity for mitigating the catastrophic forgetting under cross-domain scenarios.
zh

[CV-212] Self-NPO: Negative Preference Optimization of Diffusion Models by Simply Learning from Itself without Explicit Preference Annotations

【速读】:该论文试图解决现有偏好优化(Preference Optimization, PO)方法在生成式人工智能(Generative AI)模型中对负面偏好优化(Negative Preference Optimization, NPO)的不足,即现有方法依赖于昂贵且脆弱的显式偏好标注过程,如人工成对标注或奖励模型训练,限制了其在数据稀缺或难以获取领域的实用性。解决方案的关键在于提出Self-NPO,一种仅依赖模型自身进行负向偏好优化的方法,无需人工数据标注或奖励模型训练,同时具备高效性且不需大量数据采样,能够无缝集成到多种扩散模型中,提升生成质量和与人类偏好的对齐度。

链接: https://arxiv.org/abs/2505.11777
作者: Fu-Yun Wang,Keqiang Sun,Yao Teng,Xihui Liu,Jiaming Song,Hongsheng Li
机构: MMLab, CUHK(多媒体实验室,香港中文大学); MMLab, HKU(多媒体实验室,香港大学); Luma AI(卢马人工智能); CPII under InnoHK(创新科技研究院,香港创新科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in various visual generation tasks, including image, video, and 3D content generation. Preference optimization (PO) is a prominent and growing area of research that aims to align these models with human preferences. While existing PO methods primarily concentrate on producing favorable outputs, they often overlook the significance of classifier-free guidance (CFG) in mitigating undesirable results. Diffusion-NPO addresses this gap by introducing negative preference optimization (NPO), training models to generate outputs opposite to human preferences and thereby steering them away from unfavorable outcomes. However, prior NPO approaches, including Diffusion-NPO, rely on costly and fragile procedures for obtaining explicit preference annotations (e.g., manual pairwise labeling or reward model training), limiting their practicality in domains where such data are scarce or difficult to acquire. In this work, we introduce Self-NPO, a Negative Preference Optimization approach that learns exclusively from the model itself, thereby eliminating the need for manual data labeling or reward model training. Moreover, our method is highly efficient and does not require exhaustive data sampling. We demonstrate that Self-NPO integrates seamlessly into widely used diffusion models, including SD1.5, SDXL, and CogVideoX, as well as models already optimized for human preferences, consistently enhancing both their generation quality and alignment with human preferences.
zh

[CV-213] chnical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Boosting Off-Road Segmentation via Photometric Distortion and Exponential Moving Averag e ICRA

【速读】:该论文旨在解决非结构化越野环境下的语义分割问题,具体针对GOOSE 2D语义分割挑战。其解决方案的关键在于采用高容量的语义分割流水线,使用FlashInternImage-B作为主干网络并结合UPerNet解码器,通过适应已有的技术而非设计新方法来应对越野场景的独特条件。此外,训练策略中结合了强光度失真增强以模拟户外地形的广泛光照变化,并利用权重的指数移动平均(EMA)以提高模型的泛化能力。

链接: https://arxiv.org/abs/2505.11769
作者: Wonjune Kim,Lae-kyoung Lee,Su-Yong An
机构: Electronics and Telecommunications Research Institute (ETRI) (电子通信研究院); Daegu-Gyeongbuk Research Center, Electronics and Telecommunications Research Institute (ETRI) (大邱庆北研究中心,电子通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Winners of the GOOSE 2D Semantic Segmentation Challenge at the IEEE ICRA Workshop on Field Robotics 2025

点击查看摘要

Abstract:We report on the application of a high-capacity semantic segmentation pipeline to the GOOSE 2D Semantic Segmentation Challenge for unstructured off-road environments. Using a FlashInternImage-B backbone together with a UPerNet decoder, we adapt established techniques, rather than designing new ones, to the distinctive conditions of off-road scenes. Our training recipe couples strong photometric distortion augmentation (to emulate the wide lighting variations of outdoor terrain) with an Exponential Moving Average (EMA) of weights for better generalization. Using only the GOOSE training dataset, we achieve 88.8% mIoU on the validation set.
zh

[CV-214] Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在有限监督和噪声支持样本下的少样本适应问题。其解决方案的关键在于提出PromptFuseNL框架,通过结合预测性提示微调与双分支正负学习,利用任务条件残差、多阶段跨模态协调以及语义硬负样本挖掘来优化类别原型,并引入无监督实例重加权策略以降低不可靠支持样本的影响,同时通过轻量级模块融合视觉和文本线索,实现高效且具有判别性的预测。

链接: https://arxiv.org/abs/2505.11758
作者: Sriram Mandalika
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Few-shot adaptation remains a core challenge for vision-language models (VLMs), especially under limited supervision and noisy support samples. We propose PromptFuseNL, a unified framework that enhances few-shot generalization by combining predictive prompt tuning with dual-branch positive and negative learning. The method refines class prototypes through task-conditioned residuals, multi-stage cross-modal coordination, and semantic hard negative mining. To address label noise, we introduce an unsupervised instance reweighting strategy that downweights unreliable support examples without requiring additional labels or structural changes. PromptFuseNL fuses visual and textual cues through lightweight modules for efficient and discriminative prediction. Evaluated across 15 benchmarks, it consistently surpasses existing prompt- and adapter-based methods in all shot settings while remaining highly efficient, achieving up to 300x faster training and 1000x lower FLOPs compared to full prompt tuning, achieving a new state-of-the-art for robust and scalable few-shot vision-language adaptation.
zh

[CV-215] X-Edit: Detecting and Localizing Edits in Images Altered by Text-Guided Diffusion Models CVPR

【速读】:该论文试图解决基于文本引导的扩散模型对图像进行局部篡改后,难以检测和定位这些细微深度伪造修改的问题。解决方案的关键在于提出一种名为X-Edit的新方法,通过使用预训练的扩散模型对图像进行逆向生成,再将得到的特征输入分割网络,利用通道和空间注意力机制显式预测编辑区域,并通过结合分割损失和相关性损失进行微调,以提高对低频区域的关注并减少高频伪影,从而实现对扩散模型生成的局部篡改区域的准确定位。

链接: https://arxiv.org/abs/2505.11753
作者: Valentina Bazyleva,Nicolo Bonettini,Gaurav Bharaj
机构: Reality Defender(现实防御者)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR (XAI4CV) 2025

点击查看摘要

Abstract:Text-guided diffusion models have significantly advanced image editing, enabling highly realistic and local modifications based on textual prompts. While these developments expand creative possibilities, their malicious use poses substantial challenges for detection of such subtle deepfake edits. To this end, we introduce Explain Edit (X-Edit), a novel method for localizing diffusion-based edits in images. To localize the edits for an image, we invert the image using a pretrained diffusion model, then use these inverted features as input to a segmentation network that explicitly predicts the edited masked regions via channel and spatial attention. Further, we finetune the model using a combined segmentation and relevance loss. The segmentation loss ensures accurate mask prediction by balancing pixel-wise errors and perceptual similarity, while the relevance loss guides the model to focus on low-frequency regions and mitigate high-frequency artifacts, enhancing the localization of subtle edits. To the best of our knowledge, we are the first to address and model the problem of localizing diffusion-based modified regions in images. We additionally contribute a new dataset of paired original and edited images addressing the current lack of resources for this task. Experimental results demonstrate that X-Edit accurately localizes edits in images altered by text-guided diffusion models, outperforming baselines in PSNR and SSIM metrics. This highlights X-Edit’s potential as a robust forensic tool for detecting and pinpointing manipulations introduced by advanced image editing techniques.
zh

[CV-216] Semantically-Aware Game Image Quality Assessment

【速读】:该论文旨在解决视频游戏图形视觉质量评估的问题,特别是在缺乏参考图像的情况下,如何有效检测和量化游戏特有的失真类型,如走样、纹理模糊和几何细节层次(LOD)问题。现有无参考图像和视频质量评估(NR-IQA/VQA)方法因主要针对压缩伪影等常见失真而无法泛化到游戏环境。该研究的关键解决方案是提出一种语义感知的无参考IQA模型,其核心包括通过知识蒸馏训练的游戏失真特征提取器(GDFE),用于检测和量化游戏特定失真,并结合CLIP嵌入实现语义门控,以动态调整特征重要性。该方法在不同图形质量预设下的游戏数据上进行训练,使质量评分与人类感知一致,从而实现了对未知游戏场景的有效质量评估。

链接: https://arxiv.org/abs/2505.11724
作者: Kai Zhu,Vignesh Edithal,Le Zhang,Ilia Blank,Imran Junejo
机构: AMD(超威半导体)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:Assessing the visual quality of video game graphics presents unique challenges due to the absence of reference images and the distinct types of distortions, such as aliasing, texture blur, and geometry level of detail (LOD) issues, which differ from those in natural images or user-generated content. Existing no-reference image and video quality assessment (NR-IQA/VQA) methods fail to generalize to gaming environments as they are primarily designed for distortions like compression artifacts. This study introduces a semantically-aware NR-IQA model tailored to gaming. The model employs a knowledge-distilled Game distortion feature extractor (GDFE) to detect and quantify game-specific distortions, while integrating semantic gating via CLIP embeddings to dynamically weight feature importance based on scene content. Training on gameplay data recorded across graphical quality presets enables the model to produce quality scores that align with human perception. Our results demonstrate that the GDFE, trained through knowledge distillation from binary classifiers, generalizes effectively to intermediate distortion levels unseen during training. Semantic gating further improves contextual relevance and reduces prediction variance. In the absence of in-domain NR-IQA baselines, our model outperforms out-of-domain methods and exhibits robust, monotonic quality trends across unseen games in the same genre. This work establishes a foundation for automated graphical quality assessment in gaming, advancing NR-IQA methods in this domain.
zh

[CV-217] UGoDIT: Unsupervised Group Deep Image Prior Via Transferable Weights

【速读】:该论文旨在解决在低数据条件下,传统生成式AI(Generative AI)模型和基于深度图像先验(Deep Image Prior, DIP)的方法在逆成像问题中的局限性。具体而言,现有方法要么需要大量干净的训练数据,要么在没有干净真值图像的情况下容易出现噪声过拟合且计算成本高。该论文提出了一种名为UGoDIT的无监督组DIP方法,其关键在于通过优化共享编码器和M个解耦解码器来学习可迁移的权重,在测试时利用这些已学习的权重与部分参数优化相结合,以实现对未见过的退化图像的重建,从而在减少数据依赖的同时提升重建质量和收敛速度。

链接: https://arxiv.org/abs/2505.11720
作者: Shijun Liang,Ismail R. Alkhouri,Siddhant Gautam,Qing Qu,Saiprasad Ravishankar
机构: Michigan State University (密歇根州立大学); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent advances in data-centric deep generative models have led to significant progress in solving inverse imaging problems. However, these models (e.g., diffusion models (DMs)) typically require large amounts of fully sampled (clean) training data, which is often impractical in medical and scientific settings such as dynamic imaging. On the other hand, training-data-free approaches like the Deep Image Prior (DIP) do not require clean ground-truth images but suffer from noise overfitting and can be computationally expensive as the network parameters need to be optimized for each measurement set independently. Moreover, DIP-based methods often overlook the potential of learning a prior using a small number of sub-sampled measurements (or degraded images) available during training. In this paper, we propose UGoDIT, an Unsupervised Group DIP via Transferable weights, designed for the low-data regime where only a very small number, M, of sub-sampled measurement vectors are available during training. Our method learns a set of transferable weights by optimizing a shared encoder and M disentangled decoders. At test time, we reconstruct the unseen degraded image using a DIP network, where part of the parameters are fixed to the learned weights, while the remaining are optimized to enforce measurement consistency. We evaluate UGoDIT on both medical (multi-coil MRI) and natural (super resolution and non-linear deblurring) image recovery tasks under various settings. Compared to recent standalone DIP methods, UGoDIT provides accelerated convergence and notable improvement in reconstruction quality. Furthermore, our method achieves performance competitive with SOTA DM-based and supervised approaches, despite not requiring large amounts of clean training data. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV) Cite as: arXiv:2505.11720 [cs.CV] (or arXiv:2505.11720v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.11720 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-218] EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

【速读】:该论文旨在解决操作任务中模仿学习的样本稀缺问题(imitation learning for manipulation data scarcity problem),这一问题在具身操作领域尤为突出,因为缺乏类似自然语言和2D计算机视觉中的大规模数据集。解决方案的关键在于利用第一视角人类视频作为可扩展的数据源,并通过Apple Vision Pro采集了目前最大且最多样化的具身操作数据集EgoDex,该数据集包含829小时的第一视角视频及同步的3D手部和手指追踪数据,覆盖了194种不同的桌面操作任务,为模仿学习策略的训练与评估提供了基础。

链接: https://arxiv.org/abs/2505.11709
作者: Ryan Hoque,Peide Huang,David J. Yoon,Mouli Sivapurapu,Jian Zhang
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models.
zh

[CV-219] Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration

【速读】:该论文旨在解决扩散变压器(Diffusion Transformers)在视觉生成任务中计算成本过高的问题。现有方法通过共享相似标记的去噪过程来压缩模型,但忽略了扩散模型的去噪先验,导致加速效果不佳且图像质量下降。该研究提出的关键解决方案是:关注扩散过程未注意区域中的特征冗余,并基于“结构-细节”去噪先验分析特征冗余的位置和程度,进而引入SDTM(Structure-Then-Detail Token Merging)方法,通过动态压缩特征冗余、调整压缩比和提示重加权等策略实现高效优化。

链接: https://arxiv.org/abs/2505.11707
作者: Haipeng Fang,Sheng Tang,Juan Cao,Enshuo Zhang,Fan Tang,Tong-Yee Lee
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); National Cheng-Kung University (国立成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Comments: 14 pages, 14 figures. Accepted by the Proceedings of the 42nd IEEE/CVF Conference on Computer Vision and Pattern Recognition

点击查看摘要

Abstract:Diffusion transformers have shown exceptional performance in visual generation but incur high computational costs. Token reduction techniques that compress models by sharing the denoising process among similar tokens have been introduced. However, existing approaches neglect the denoising priors of the diffusion models, leading to suboptimal acceleration and diminished image quality. This study proposes a novel concept: attend to prune feature redundancies in areas not attended by the diffusion process. We analyze the location and degree of feature redundancies based on the structure-then-detail denoising priors. Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. Specifically, we design dynamic visual token merging, compression ratio adjusting, and prompt reweighting for different stages. Served in a post-training way, the proposed method can be integrated seamlessly into any DiT architecture. Extensive experiments across various backbones, schedulers, and datasets showcase the superiority of our method, for example, it achieves 1.55 times acceleration with negligible impact on image quality. Project page: this https URL.
zh

[CV-220] LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance

【速读】:该论文试图解决合成数据在监督学习中难以显著提升模型性能的问题,主要原因是现有合成数据集未能准确再现真实数据的分布,缺乏必要的保真度或多样性。论文提出的解决方案关键在于引入一种名为LoFT(LoRA-Fused Training-data Generation with Few-shot Guidance)的新型数据集生成框架,该方法通过在单个真实图像上微调LoRA权重并在推理时进行融合,生成结合真实图像特征的合成图像,从而提升生成数据的多样性和保真度。

链接: https://arxiv.org/abs/2505.11703
作者: Jae Myung Kim,Stephan Alaniz,Cordelia Schmid,Zeynep Akata
机构: University of Tübingen (图宾根大学); Helmholtz Munich (赫尔姆霍兹慕尼黑); Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); LTCI, Télécom Paris, Institut Polytechnique de Paris, France (LTCI,巴黎电信,巴黎综合理工学院,法国); Inria, Ecole normale supérieure, CNRS, PSL Research University (Inria,巴黎高等师范学院,法国国家科学研究中心,PSL研究大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in text-to-image generation, using synthetically generated data seldom brings a significant boost in performance for supervised learning. Oftentimes, synthetic datasets do not faithfully recreate the data distribution of real data, i.e., they lack the fidelity or diversity needed for effective downstream model training. While previous work has employed few-shot guidance to address this issue, existing methods still fail to capture and generate features unique to specific real images. In this paper, we introduce a novel dataset generation framework named LoFT, LoRA-Fused Training-data Generation with Few-shot Guidance. Our method fine-tunes LoRA weights on individual real images and fuses them at inference time, producing synthetic images that combine the features of real images for improved diversity and fidelity of generated data. We evaluate the synthetic data produced by LoFT on 10 datasets, using 8 to 64 real images per class as guidance and scaling up to 1000 images per class. Our experiments show that training on LoFT-generated data consistently outperforms other synthetic dataset methods, significantly increasing accuracy as the dataset size increases. Additionally, our analysis demonstrates that LoFT generates datasets with high fidelity and sufficient diversity, which contribute to the performance improvement. The code is available at this https URL.
zh

[CV-221] DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation CVPR2025

【速读】:该论文旨在解决开放词汇语义分割中图像与文本嵌入之间的固有领域差异问题,以及仅依赖深度文本对齐特征导致的浅层特征指导不足问题,这些问题限制了小物体和细粒度特征的检测能力,从而影响分割精度。其解决方案的关键在于提出一种双提示框架(DPSeg),通过结合双提示成本体积生成、成本体积引导解码器和语义引导提示优化策略,利用双提示机制缓解视觉提示生成中的对齐问题,同时引入视觉嵌入以缩小文本与图像嵌入之间的领域差距,并通过浅层特征提供多层级指导。

链接: https://arxiv.org/abs/2505.11676
作者: Ziyu Zhao,Xiaoguang Li,Linjia Shi,Nasrin Imanpour,Song Wang
机构: University of South Carolina (南卡罗来纳大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.
zh

[CV-222] MIRACL-VISION: A Large multilingual visual document retrieval benchmark

【速读】:该论文旨在解决多语言视觉文档检索(multilingual visual document retrieval)评估基准不足的问题,当前的评估基准在语言覆盖范围、问题生成方式和语料库规模方面存在局限。其解决方案的关键在于引入MIRACL-VISION,这是一个涵盖18种语言的多语言视觉文档检索评估基准,基于MIRACL数据集构建,并通过人工密集标注流程确保问题质量。同时,为降低语料库规模以提升计算效率,设计了一种去除“简单负样本”的方法,从而在保持数据集挑战性的同时提高评估的可行性。

链接: https://arxiv.org/abs/2505.11651
作者: Radek Osmulsk,Gabriel de Souza P. Moreira,Ronay Ak,Mengyao Xu,Benedikt Schifferer,Even Oldridge
机构: The Thørväld Group (The Thørväld Group); Institute for Clarity in Documentation (Institute for Clarity in Documentation); Inria Paris-Rocquencourt (Inria Paris-Rocquencourt); Rajiv Gandhi University (Rajiv Gandhi University); Tsinghua University (Tsinghua University); Palmer Research Laboratories (Palmer Research Laboratories); The Kumquat Consortium (The Kumquat Consortium)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document retrieval is an important task for search and Retrieval-Augmented Generation (RAG) applications. Large Language Models (LLMs) have contributed to improving the accuracy of text-based document retrieval. However, documents with complex layout and visual elements like tables, charts and infographics are not perfectly represented in textual format. Recently, image-based document retrieval pipelines have become popular, which use visual large language models (VLMs) to retrieve relevant page images given a query. Current evaluation benchmarks on visual document retrieval are limited, as they primarily focus only English language, rely on synthetically generated questions and offer a small corpus size. Therefore, we introduce MIRACL-VISION, a multilingual visual document retrieval evaluation benchmark. MIRACL-VISION covers 18 languages, and is an extension of the MIRACL dataset, a popular benchmark to evaluate text-based multilingual retrieval pipelines. MIRACL was built using a human-intensive annotation process to generate high-quality questions. In order to reduce MIRACL-VISION corpus size to make evaluation more compute friendly while keeping the datasets challenging, we have designed a method for eliminating the “easy” negatives from the corpus. We conducted extensive experiments comparing MIRACL-VISION with other benchmarks, using popular public text and image models. We observe a gap in state-of-the-art VLM-based embedding models on multilingual capabilities, with up to 59.7% lower retrieval accuracy than a text-based retrieval models. Even for the English language, the visual models retrieval accuracy is 12.1% lower compared to text-based models. MIRACL-VISION is a challenging, representative, multilingual evaluation benchmark for visual retrieval pipelines and will help the community build robust models for document retrieval.
zh

[CV-223] Urban Representation Learning for Fine-grained Economic Mapping: A Semi-supervised Graph-based Approach

【速读】:该论文旨在解决在数据稀缺场景下,现有经济映射方法对半监督学习的忽视以及缺乏统一多任务框架以实现全面行业经济分析的问题。其解决方案的关键在于提出SemiGTX,一个可解释的半监督图学习框架,该框架通过专用融合编码模块整合多种地理空间数据模态,并引入结合空间自监督与局部掩码监督回归的半信息损失函数,从而生成更具信息量和有效性的区域表示,同时通过多任务学习在统一模型中同步映射第一、第二和第三产业的GDP。

链接: https://arxiv.org/abs/2505.11645
作者: Jinzhou Cao,Xiangxu Wang,Jiashi Chen,Wei Tu,Zhenhui Li,Xindong Yang,Tianhong Zhao,Qingquan Li
机构: Shenzhen Technology University (深圳技术大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in International Society Journal of Photogrammetry and Remote Sensing (ISPRS). 70 pages, 10 Figures, 15 Tables

点击查看摘要

Abstract:Fine-grained economic mapping through urban representation learning has emerged as a crucial tool for evidence-based economic decisions. While existing methods primarily rely on supervised or unsupervised approaches, they often overlook semi-supervised learning in data-scarce scenarios and lack unified multi-task frameworks for comprehensive sectoral economic analysis. To address these gaps, we propose SemiGTX, an explainable semi-supervised graph learning framework for sectoral economic mapping. The framework is designed with dedicated fusion encoding modules for various geospatial data modalities, seamlessly integrating them into a cohesive graph structure. It introduces a semi-information loss function that combines spatial self-supervision with locally masked supervised regression, enabling more informative and effective region representations. Through multi-task learning, SemiGTX concurrently maps GDP across primary, secondary, and tertiary sectors within a unified model. Extensive experiments conducted in the Pearl River Delta region of China demonstrate the model’s superior performance compared to existing methods, achieving R2 scores of 0.93, 0.96, and 0.94 for the primary, secondary and tertiary sectors, respectively. Cross-regional experiments in Beijing and Chengdu further illustrate its generality. Systematic analysis reveals how different data modalities influence model predictions, enhancing explainability while providing valuable insights for regional development planning. This representation learning framework advances regional economic monitoring through diverse urban data integration, providing a robust foundation for precise economic forecasting.
zh

[CV-224] BandRC: Band Shifted Raised Cosine Activated Implicit Neural Representations ICCV2025

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在信号表示能力上的局限性,包括谱偏差、对噪声的鲁棒性不足以及同时捕捉局部与全局特征的困难,此外还存在手动参数调优的需求。其解决方案的关键在于引入一种新的激活函数——带移位升余弦激活的隐式神经网络(Band Shifted Raised Cosine Activated Implicit Neural Networks, BandRC),并通过从信号中提取的深度先验知识调整激活函数,以提升信号表示能力。

链接: https://arxiv.org/abs/2505.11640
作者: Pandula Thennakoon,Avishka Ranasinghe,Mario De Silva,Buwaneka Epakanda,Roshan Godaliyadda,Parakrama Ekanayake,Vijitha Herath
机构: University of Peradeniya (佩拉德尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted as a conference paper to ICCV 2025

点击查看摘要

Abstract:In recent years, implicit neural representations(INRs) have gained popularity in the computer vision community. This is mainly due to the strong performance of INRs in many computer vision tasks. These networks can extract a continuous signal representation given a discrete signal representation. In previous studies, it has been repeatedly shown that INR performance has a strong correlation with the activation functions used in its multilayer perceptrons. Although numerous activation functions have been proposed that are competitive with one another, they share some common set of challenges such as spectral bias(Lack of sensitivity to high-frequency content in signals), limited robustness to signal noise and difficulties in simultaneous capturing both local and global features. and furthermore, the requirement for manual parameter tuning. To address these issues, we introduce a novel activation function, Band Shifted Raised Cosine Activated Implicit Neural Networks \textbf(BandRC) tailored to enhance signal representation capacity further. We also incorporate deep prior knowledge extracted from the signal to adjust the activation functions through a task-specific model. Through a mathematical analysis and a series of experiments which include image reconstruction (with a +8.93 dB PSNR improvement over the nearest counterpart), denoising (with a +0.46 dB increase in PSNR), super-resolution (with a +1.03 dB improvement over the nearest State-Of-The-Art (SOTA) method for 6X super-resolution), inpainting, and 3D shape reconstruction we demonstrate the dominance of BandRC over existing state of the art activation functions.
zh

[CV-225] Improved Bag-of-Words Image Retrieval with Geometric Constraints for Ground Texture Localization ICRA2025

【速读】:该论文旨在解决在动态环境中实现高精度、低成本的地面纹理定位问题,特别是在同时定位与地图构建(SLAM)中的全局定位和回环检测任务。其解决方案的关键在于对基于词袋(Bag-of-Words, BoW)图像检索系统的显著改进,包括采用近似k-均值(Approximate k-Means, AKM)词汇表结合软分配机制,并利用地面纹理定位中固有的方向一致性与尺度恒定性约束。通过区分全局定位与回环检测的不同需求,提出了高精度与高速度两种版本的算法,从而提升了定位与回环检测的准确性和召回率。

链接: https://arxiv.org/abs/2505.11620
作者: Aaron Wilhelm,Nils Napp
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICRA 2025

点击查看摘要

Abstract:Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate k -means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method’s effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.
zh

[CV-226] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

【速读】:该论文旨在解决注意力机制在计算效率上的瓶颈问题,特别是其二次时间复杂度带来的性能限制。解决方案的关键在于两个方面的创新:首先,利用Blackwell GPU中的新型FP4 Tensor Cores加速注意力计算,实现了比RTX5090上最快的FlashAttention快5倍的1038 TOPS算力;其次,首次将低比特注意力应用于训练任务,设计了一种适用于前向和反向传播的8-bit注意力机制,实验证明其在微调任务中可实现无损性能,但在预训练任务中收敛速度较慢。

链接: https://arxiv.org/abs/2505.11594
作者: Jintao Zhang,Jia Wei,Pengle Zhang,Xiaoming Xu,Haofeng Huang,Haoxu Wang,Kai Jiang,Jun Zhu,Jianfei Chen
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at this https URL.
zh

[CV-227] Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

【速读】:该论文试图解决的问题是:在深度学习模型中,更好的性能是否必然意味着更优的内部表示(internal representations)。论文的解决方案关键在于通过对比通过开放式搜索过程进化得到的神经网络与传统随机梯度下降(stochastic gradient descent, SGD)训练的网络,在生成单张图像这一简单任务中的表现,揭示两者在内部表示上的差异。研究发现,SGD训练的网络表现出一种称为断裂纠缠表示(fractured entangled representation, FER)的无序状态,而进化得到的网络则显著减少了FER,接近统一因子化表示(unified factored representation, UFR)。这一结果表明,优化内部表示可能对提升模型的泛化、创造力和持续学习能力具有重要意义。

链接: https://arxiv.org/abs/2505.11581
作者: Akarsh Kumar,Jeff Clune,Joel Lehman,Kenneth O. Stanley
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 43 pages, 25 figures

点击查看摘要

Abstract:Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representations? While the representational optimist assumes it must, this position paper challenges that view. We compare neural networks evolved through an open-ended search process to networks trained via conventional stochastic gradient descent (SGD) on the simple task of generating a single image. This minimal setup offers a unique advantage: each hidden neuron’s full functional behavior can be easily visualized as an image, thus revealing how the network’s output behavior is internally constructed neuron by neuron. The result is striking: while both networks produce the same output behavior, their internal representations differ dramatically. The SGD-trained networks exhibit a form of disorganization that we term fractured entangled representation (FER). Interestingly, the evolved networks largely lack FER, even approaching a unified factored representation (UFR). In large models, FER may be degrading core model capacities like generalization, creativity, and (continual) learning. Therefore, understanding and mitigating FER could be critical to the future of representation learning.
zh

[CV-228] Concept-Guided Interpretability via Neural Chunking

【速读】:该论文试图解决神经网络作为“黑箱”系统难以解释其内部工作机制的问题,其核心挑战在于理解神经网络如何表征和处理信息。解决方案的关键在于提出“反射假说”(Reflection Hypothesis),即神经网络的原始群体活动模式反映了训练数据中的规律性,并通过受认知启发的分块方法(chunking)将高维神经群体动态分解为可解释的单元,这些单元对应于潜在的概念。该方法包括三种互补的技术:离散序列分块(DSC)、群体平均(PA)和无监督分块发现(UCD),能够根据标签可用性和维度进行灵活应用,从而有效提取实体并揭示神经网络内部的计算机制。

链接: https://arxiv.org/abs/2505.11576
作者: Shuchen Wu,Stephan Alaniz,Shyamgopal Karthik,Peter Dayan,Eric Schulz,Zeynep Akata
机构: Allen Institute (艾伦研究所); University of Washington (华盛顿大学); Télécom Paris, Institut Polytechnique de Paris (巴黎高等电信学院,巴黎综合理工学院); Institute of Explainable Machine Learning, Helmholtz Munich (可解释机器学习研究所,慕尼黑亥姆霍兹研究中心); Department of Computational Neuroscience, Max Planck Institute for Biological Cybernetics (计算神经科学系,马克斯·普朗克生物控制论研究所); Institute for Human-Centered AI, Helmholtz Munich (以人为本人工智能研究所,慕尼黑亥姆霍兹研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 32 figures. arXiv admin note: text overlap with arXiv:2502.01803

点击查看摘要

Abstract:Neural networks are often black boxes, reflecting the significant challenge of understanding their internal workings. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage cognitively-inspired methods of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract these emerging entities, complementing each other based on label availability and dimensionality. Discrete sequence chunking (DSC) creates a dictionary of entities; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting entities across varying model sizes, ranging from inducing compositionality in RNNs to uncovering recurring neural population states in large models with diverse architectures, and illustrate their advantage over other methods. Throughout, we observe a robust correspondence between the extracted entities and concrete or abstract concepts. Artificially inducing the extracted entities in neural populations effectively alters the network’s generation of associated concepts. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.
zh

[CV-229] Bridging Human Oversight and Black-box Driver Assistance: Vision-Language Models for Predictive Alerting in Lane Keeping Assist Systems

【速读】:该论文试图解决当前车道保持辅助系统(Lane Keeping Assist, LKA)在实际应用中因黑箱特性导致的不可预测故障问题,从而影响驾驶员的预判与信任。解决方案的关键在于提出LKAlert,一个基于视觉语言模型(Vision-Language Model, VLM)的监督警报系统,通过融合车载摄像头视频和CAN数据,并利用可解释模型生成的替代车道分割特征作为自动化引导注意力机制,实现对LKA潜在风险的提前1-3秒预测。此外,LKAlert不仅发出预测性警报,还提供简洁的自然语言解释,以增强驾驶员的情境意识和信任度。

链接: https://arxiv.org/abs/2505.11535
作者: Yuhang Wang,Hao Zhou
机构: University of South Florida (南佛罗里达大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Lane Keeping Assist systems, while increasingly prevalent, often suffer from unpredictable real-world failures, largely due to their opaque, black-box nature, which limits driver anticipation and trust. To bridge the gap between automated assistance and effective human oversight, we present LKAlert, a novel supervisory alert system that leverages VLM to forecast potential LKA risk 1-3 seconds in advance. LKAlert processes dash-cam video and CAN data, integrating surrogate lane segmentation features from a parallel interpretable model as automated guiding attention. Unlike traditional binary classifiers, LKAlert issues both predictive alert and concise natural language explanation, enhancing driver situational awareness and trust. To support the development and evaluation of such systems, we introduce OpenLKA-Alert, the first benchmark dataset designed for predictive and explainable LKA failure warnings. It contains synchronized multimodal inputs and human-authored justifications across annotated temporal windows. We further contribute a generalizable methodological framework for VLM-based black-box behavior prediction, combining surrogate feature guidance with LoRA. This framework enables VLM to reason over structured visual context without altering its vision backbone, making it broadly applicable to other complex, opaque systems requiring interpretable oversight. Empirical results correctly predicts upcoming LKA failures with 69.8% accuracy and a 58.6% F1-score. The system also generates high-quality textual explanations for drivers (71.7 ROUGE-L) and operates efficiently at approximately 2 Hz, confirming its suitability for real-time, in-vehicle use. Our findings establish LKAlert as a practical solution for enhancing the safety and usability of current ADAS and offer a scalable paradigm for applying VLMs to human-centered supervision of black-box automation.
zh

[CV-230] Improving Open-Set Semantic Segmentation in 3D Point Clouds by Conditional Channel Capacity Maximization: Preliminary Results

【速读】:该论文旨在解决点云语义分割中的开放集问题(Open-Set Semantic Segmentation, O3S),即模型在面对训练中未出现的新类别时,难以正确识别或分割这些对象。其解决方案的关键在于提出一种可插拔的框架,通过将分割流程建模为条件马尔可夫链,推导出一种名为条件通道容量最大化(Conditional Channel Capacity Maximization, 3CM)的新正则化项,该方法通过最大化特征与预测之间在每个类别条件下的互信息,促使编码器保留更丰富的、与标签相关的特征,从而提升网络区分和分割未见过类别的能力。

链接: https://arxiv.org/abs/2505.11521
作者: Wang Fang,Shirin Rahimi,Olivia Bennett,Sophie Carter,Mitra Hassani,Xu Lan,Omid Javadi,Lucas Mitchell
机构: University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Point-cloud semantic segmentation underpins a wide range of critical applications. Although recent deep architectures and large-scale datasets have driven impressive closed-set performance, these models struggle to recognize or properly segment objects outside their training classes. This gap has sparked interest in Open-Set Semantic Segmentation (O3S), where models must both correctly label known categories and detect novel, unseen classes. In this paper, we propose a plug and play framework for O3S. By modeling the segmentation pipeline as a conditional Markov chain, we derive a novel regularizer term dubbed Conditional Channel Capacity Maximization (3CM), that maximizes the mutual information between features and predictions conditioned on each class. When incorporated into standard loss functions, 3CM encourages the encoder to retain richer, label-dependent features, thereby enhancing the network’s ability to distinguish and segment previously unseen categories. Experimental results demonstrate effectiveness of proposed method on detecting unseen objects. We further outline future directions for dynamic open-world adaptation and efficient information-theoretic estimation.
zh

[CV-231] Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition

【速读】:该论文试图解决视频场景识别(video scene recognition)问题,即学习一种高层次的视频表示以对视频中的场景进行分类。现有方法通常仅从视觉或文本信息的时序角度识别场景,忽略了单帧中隐藏的有价值信息,而早期研究则仅从非时序角度识别单独图像的场景。论文认为时序与非时序视角对于该任务均具有重要意义且互为补充,同时外部知识的引入也有助于提升视频理解。解决方案的关键在于提出一种新颖的双流框架,从时序和非时序两个视角建模视频表示,并通过自蒸馏(self-distillation)在端到端过程中融合两者;此外,还设计了一种知识增强的特征融合与标签预测方法,以自然地将知识引入视频场景识别任务。

链接: https://arxiv.org/abs/2401.04354
作者: Xuzheng Yu,Chen Jiang,Wei Zhang,Tian Gan,Linlin Chao,Jianan Zhao,Yuan Cheng,Qingpei Guo,Wei Chu
机构: Shandong University (山东大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Due to the diversity and complexity of video contents in realistic scenarios, this task remains a challenge. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective, ignoring the valuable information hidden in single frames, while several earlier studies only recognize scenes for separate images in a non-temporal perspective. We argue that these two perspectives are both meaningful for this task and complementary to each other, meanwhile, externally introduced knowledge can also promote the comprehension of videos. We propose a novel two-stream framework to model video representations from multiple perspectives, i.e. temporal and non-temporal perspectives, and integrate the two perspectives in an end-to-end manner by self-distillation. Besides, we design a knowledge-enhanced feature fusion and label prediction method that contributes to naturally introducing knowledge into the task of video scene recognition. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed method.
zh

[CV-232] Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval ACM-MM2021

【速读】:该论文旨在解决大规模视频检索中段级内容相似性匹配(Segment-level Content-Based Video Retrieval, S-CBVR)的挑战,即如何在保证高时间对齐精度的同时实现高效的计算和低存储消耗。其解决方案的关键在于提出了一种端到端训练的段相似性与对齐网络(Segment Similarity and Alignment Network, SSAN),该网络包含两个核心模块:一种高效的自监督关键帧提取(Self-supervised Keyframe Extraction, SKE)模块,用于减少冗余帧特征;以及一种鲁棒的时间相似模式检测(Similarity Pattern Detection, SPD)模块,用于提升时间对齐的准确性与效率。

链接: https://arxiv.org/abs/2309.11091
作者: Chen Jiang,Kaiming Huang,Sifeng He,Xudong Yang,Wei Zhang,Xiaobo Zhang,Yuan Cheng,Lei Yang,Qing Wang,Furong Xu,Tan Pan,Wei Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by ACM MM 2021

点击查看摘要

Abstract:With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods.
zh

[CV-233] GuidedMorph: Two-Stage Deformable Registration for Breast MRI

【速读】:该论文旨在解决乳腺MRI图像在不同时间点配准中的难题,特别是在密集组织和高度非刚性变形下的 anatomical structures 对齐与肿瘤进展跟踪问题。传统配准方法主要关注一般结构的对齐,而忽略了内部细节的精确匹配。其解决方案的关键在于提出一种名为 \textbf{GuidedMorph} 的两阶段配准框架,该框架结合了单尺度网络进行全局结构对齐,并引入基于密集组织信息的运动追踪机制,同时通过双空间变换网络(Dual Spatial Transformer Network, DSTN)融合学习到的变换场,以提高整体配准精度。此外,还提出了一种基于欧几里得距离变换(Euclidean Distance Transform, EDT)的新型变形方法,以准确地对齐密集组织和乳腺掩膜,从而在变形过程中保留细结构细节。

链接: https://arxiv.org/abs/2505.13414
作者: Yaqian Chen,Hanxue Gu,Haoyu Dong,Qihang Li,Yuwen Chen,Nicholas Konz,Lin Li,Maciej A. Mazurowski
机构: Duke University (杜克大学); University of Florida (佛罗里达大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately registering breast MR images from different time points enables the alignment of anatomical structures and tracking of tumor progression, supporting more effective breast cancer detection, diagnosis, and treatment planning. However, the complexity of dense tissue and its highly non-rigid nature pose challenges for conventional registration methods, which primarily focus on aligning general structures while overlooking intricate internal details. To address this, we propose \textbfGuidedMorph, a novel two-stage registration framework designed to better align dense tissue. In addition to a single-scale network for global structure alignment, we introduce a framework that utilizes dense tissue information to track breast movement. The learned transformation fields are fused by introducing the Dual Spatial Transformer Network (DSTN), improving overall alignment accuracy. A novel warping method based on the Euclidean distance transform (EDT) is also proposed to accurately warp the registered dense tissue and breast masks, preserving fine structural details during deformation. The framework supports paradigms that require external segmentation models and with image data only. It also operates effectively with the VoxelMorph and TransMorph backbones, offering a versatile solution for breast registration. We validate our method on ISPY2 and internal dataset, demonstrating superior performance in dense tissue, overall breast alignment, and breast structural similarity index measure (SSIM), with notable improvements by over 13.01% in dense tissue Dice, 3.13% in breast Dice, and 1.21% in breast SSIM compared to the best learning-based baseline.
zh

[CV-234] Higher fidelity perceptual image and video compression with a latent conditioned residual denoising diffusion model ECCV2024

【速读】:该论文试图解决扩散模型在感知图像压缩中虽能生成高质量图像但相较于传统或学习型压缩方案在保真度(如PSNR)上表现较差的问题。解决方案的关键在于提出一种混合压缩方案,通过引入解码器网络生成初始图像以优化失真指标,并利用条件扩散模型预测残差来进一步提升感知质量,从而在保持与CDC相当的LPIPS和FID指标的同时显著提高PSNR性能。

链接: https://arxiv.org/abs/2505.13152
作者: Jonas Brenig,Radu Timofte
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AIM Workshop 2024 at ECCV 2024

点击查看摘要

Abstract:Denoising diffusion models achieved impressive results on several image generation tasks often outperforming GAN based models. Recently, the generative capabilities of diffusion models have been employed for perceptual image compression, such as in CDC. A major drawback of these diffusion-based methods is that, while producing impressive perceptual quality images they are dropping in fidelity/increasing the distortion to the original uncompressed images when compared with other traditional or learned image compression schemes aiming for fidelity. In this paper, we propose a hybrid compression scheme optimized for perceptual quality, extending the approach of the CDC model with a decoder network in order to reduce the impact on distortion metrics such as PSNR. After using the decoder network to generate an initial image, optimized for distortion, the latent conditioned diffusion model refines the reconstruction for perceptual quality by predicting the residual. On standard benchmarks, we achieve up to +2dB PSNR fidelity improvements while maintaining comparable LPIPS and FID perceptual scores when compared with CDC. Additionally, the approach is easily extensible to video compression, where we achieve similar results.
zh

[CV-235] A generalisable head MRI defacing pipeline: Evaluation on 2566 meningioma scans

【速读】:该论文旨在解决医学影像研究中患者隐私保护与脑部解剖结构保留之间的矛盾问题,具体是通过可靠的MRI去标识化(defacing)技术来实现这一目标。其解决方案的关键在于提出了一种鲁棒且可泛化的去标识化流程,该流程结合了基于图谱的配准(atlas-based registration)与脑部掩码(brain masking),从而在确保患者隐私的同时保持脑组织区域的高质量完整性。

链接: https://arxiv.org/abs/2505.12999
作者: Lorena Garcia-Foncillas Macias(1),Aaron Kujawa(1),Aya Elshalakany(1,2),Jonathan Shapey(1,2),Tom Vercauteren(1) ((1) School of Biomedical Engineering and Imaging Sciences, Kings College London (2) Department of Neurosurgery, Kings College Hospital NHS Foundation Trust)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable MRI defacing techniques to safeguard patient privacy while preserving brain anatomy are critical for research collaboration. Existing methods often struggle with incomplete defacing or degradation of brain tissue regions. We present a robust, generalisable defacing pipeline for high-resolution MRI that integrates atlas-based registration with brain masking. Our method was evaluated on 2,566 heterogeneous clinical scans for meningioma and achieved a 99.92 per cent success rate (2,564/2,566) upon visual inspection. Excellent anatomical preservation is demonstrated with a Dice similarity coefficient of 0.9975 plus or minus 0.0023 between brain masks automatically extracted from the original and defaced volumes. Source code is available at this https URL.
zh

[CV-236] Enhancing Diffusion-Weighted Images (DWI) for Diffusion MRI: Is it Enough without Non-Diffusion-Weighted B=0 Reference?

【速读】:该论文旨在解决高分辨率扩散磁共振成像(dMRI)中由于采集时间和信噪比(SNR)之间的固有权衡而导致的成像难题。传统方法仅优化扩散加权图像(DWIs),而未考虑其与非扩散加权(b=0)参考图像之间的关系,这影响了扩散指标如表观扩散系数(ADC)、分数各向异性(FA)和平均扩散率(MD)的准确性。论文提出的关键解决方案是引入一种新的比例损失(ratio loss),即预测值与真实值的DWI/b=0比值对数之间的均方误差(MSE)损失,以有效减少生成DWIs与b=0图像之间比值的误差,从而提升dMRI超分辨率及扩散指标的保真度。

链接: https://arxiv.org/abs/2505.12978
作者: Yinzhe Wu,Jiahao Huang,Fanwen Wang,Mengze Gao,Congyu Liao,Guang Yang,Kawin Setsompop
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE ISBI 2025

点击查看摘要

Abstract:Diffusion MRI (dMRI) is essential for studying brain microstructure, but high-resolution imaging remains challenging due to the inherent trade-offs between acquisition time and signal-to-noise ratio (SNR). Conventional methods often optimize only the diffusion-weighted images (DWIs) without considering their relationship with the non-diffusion-weighted (b=0) reference images. However, calculating diffusion metrics, such as the apparent diffusion coefficient (ADC) and diffusion tensor with its derived metrics like fractional anisotropy (FA) and mean diffusivity (MD), relies on the ratio between each DWI and the b=0 image, which is crucial for clinical observation and diagnostics. In this study, we demonstrate that solely enhancing DWIs using a conventional pixel-wise mean squared error (MSE) loss is insufficient, as the error in ratio between generated DWIs and b=0 diverges. We propose a novel ratio loss, defined as the MSE loss between the predicted and ground-truth log of DWI/b=0 ratios. Our results show that incorporating the ratio loss significantly improves the convergence of this ratio error, achieving lower ratio MSE and slightly enhancing the peak signal-to-noise ratio (PSNR) of generated DWIs. This leads to improved dMRI super-resolution and better preservation of b=0 ratio-based features for the derivation of diffusion metrics.
zh

[CV-237] Segmentation of temporomandibular joint structures on mri images using neural networks for diagnosis of pathologies

【速读】:该论文旨在解决颞下颌关节(Temporomandibular Joint, TMJ)病理诊断中对关节盘进行精准分割的问题,尤其是在MRI图像分析中的挑战。现有解决方案(如Diagnocat和MandSeg)由于侧重于骨结构,无法有效用于关节盘的分析。研究的关键在于构建一个包含94张图像的原始数据集,并通过数据增强方法扩展数据量,随后训练并比较U-Net、YOLOv8n、YOLOv11n和Roboflow等模型,最终验证了Roboflow模型在TMJ分割任务中的有效性。

链接: https://arxiv.org/abs/2505.12963
作者: Maksim I. Ivanov,Olga E. Mendybaeva,Yuri E. Karyakin,Igor N. Glukhikh,Aleksey V. Lebedev
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:This article explores the use of artificial intelligence for the diagnosis of pathologies of the temporomandibular joint (TMJ), in particular, for the segmentation of the articular disc on MRI images. The relevance of the work is due to the high prevalence of TMJ pathologies, as well as the need to improve the accuracy and speed of diagnosis in medical institutions. During the study, the existing solutions (Diagnocat, MandSeg) were analyzed, which, as a result, are not suitable for studying the articular disc due to the orientation towards bone structures. To solve the problem, an original dataset was collected from 94 images with the classes “temporomandibular joint” and “jaw”. To increase the amount of data, augmentation methods were used. After that, the models of U-Net, YOLOv8n, YOLOv11n and Roboflow neural networks were trained and compared. The evaluation was carried out according to the Dice Score, Precision, Sensitivity, Specificity, and Mean Average Precision metrics. The results confirm the potential of using the Roboflow model for segmentation of the temporomandibular joint. In the future, it is planned to develop an algorithm for measuring the distance between the jaws and determining the position of the articular disc, which will improve the diagnosis of TMJ pathologies.
zh

[CV-238] RetinaLogos: Fine-Grained Synthesis of High-Resolution Retinal Images Through Captions

【速读】:该论文旨在解决眼科领域中高质量标注视网膜影像数据稀缺的问题,这一问题严重制约了机器学习模型的开发与应用。其解决方案的关键在于提出了一种创新的流水线,生成了一个大规模的合成Caption-CFP数据集(RetinaLogos-1400k),该数据集包含140万条样本,并利用大语言模型(LLM)对视网膜病变和关键结构进行描述。此外,基于该数据集,研究者还设计了一种三步训练框架(RetinaLogos),实现了对视网膜图像的细粒度语义控制,能够准确捕捉疾病进展的不同阶段、细微的解剖变异及特定病灶类型。

链接: https://arxiv.org/abs/2505.12887
作者: Junzhi Ning,Cheng Tang,Kaijin Zhou,Diping Song,Lihao Liu,Ming Hu,Wei Li,Yanzhou Su,Tianbing Li,Jiyao Liu,Yejin,Sheng Zhang,Yuanfeng Ji,Junjun He
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The scarcity of high-quality, labelled retinal imaging data, which presents a significant challenge in the development of machine learning models for ophthalmology, hinders progress in the field. To synthesise Colour Fundus Photographs (CFPs), existing methods primarily relying on predefined disease labels face significant limitations. However, current methods remain limited, thus failing to generate images for broader categories with diverse and fine-grained anatomical structures. To overcome these challenges, we first introduce an innovative pipeline that creates a large-scale, synthetic Caption-CFP dataset comprising 1.4 million entries, called RetinaLogos-1400k. Specifically, RetinaLogos-1400k uses large language models (LLMs) to describe retinal conditions and key structures, such as optic disc configuration, vascular distribution, nerve fibre layers, and pathological features. Furthermore, based on this dataset, we employ a novel three-step training framework, called RetinaLogos, which enables fine-grained semantic control over retinal images and accurately captures different stages of disease progression, subtle anatomical variations, and specific lesion types. Extensive experiments demonstrate state-of-the-art performance across multiple datasets, with 62.07% of text-driven synthetic images indistinguishable from real ones by ophthalmologists. Moreover, the synthetic data improves accuracy by 10%-25% in diabetic retinopathy grading and glaucoma detection, thereby providing a scalable solution to augment ophthalmic datasets.
zh

[CV-239] he Gaussian Latent Machine: Efficient Prior and Posterior Sampling for Inverse Problems

【速读】:该论文试图解决从一种包含许多标准先验和后验分布的乘积专家模型中进行采样的问题(product-of-experts-type model)。解决方案的关键在于将该模型提升为一种新的潜在变量模型,称为高斯潜在机(Gaussian latent machine),从而提出一种统一且推广性强的采样方法。该方法在一般情况下可实现高效有效的两块吉布斯采样策略,同时在特定情况下可退化为直接采样算法。

链接: https://arxiv.org/abs/2505.12836
作者: Muhamed Kuric,Martin Zach,Andreas Habring,Michael Unser,Thomas Pock
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We consider the problem of sampling from a product-of-experts-type model that encompasses many standard prior and posterior distributions commonly found in Bayesian imaging. We show that this model can be easily lifted into a novel latent variable model, which we refer to as a Gaussian latent machine. This leads to a general sampling approach that unifies and generalizes many existing sampling algorithms in the literature. Most notably, it yields a highly efficient and effective two-block Gibbs sampling approach in the general case, while also specializing to direct sampling algorithms in particular cases. Finally, we present detailed numerical experiments that demonstrate the efficiency and effectiveness of our proposed sampling approach across a wide range of prior and posterior sampling problems from Bayesian imaging.
zh

[CV-240] FreqSelect: Frequency-Aware fMRI-to-Image Reconstruction

【速读】:该论文旨在解决从功能性磁共振成像(fMRI)数据中重建自然图像这一核心挑战,其主要问题在于视觉刺激的丰富性与fMRI信号的噪声大、分辨率低之间的不匹配。现有两阶段模型虽然结合了深度变分自编码器(VAE)与扩散模型以提升任务性能,但它们对输入的所有空间频率成分一视同仁,导致模型在提取有意义特征的同时需抑制无关噪声,从而限制了效果。该论文提出的解决方案关键在于引入FreqSelect模块,这是一个轻量级、自适应的频段选择机制,能够在编码前动态强调对脑活动最具预测性的频率并抑制无信息频率,从而作为图像特征与自然数据之间的内容感知门控,提升重建质量并增强可解释性。

链接: https://arxiv.org/abs/2505.12552
作者: Junliang Ye,Lei Wang,Md Zakir Hossain
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Research report

点击查看摘要

Abstract:Reconstructing natural images from functional magnetic resonance imaging (fMRI) data remains a core challenge in natural decoding due to the mismatch between the richness of visual stimuli and the noisy, low resolution nature of fMRI signals. While recent two-stage models, combining deep variational autoencoders (VAEs) with diffusion models, have advanced this task, they treat all spatial-frequency components of the input equally. This uniform treatment forces the model to extract meaning features and suppress irrelevant noise simultaneously, limiting its effectiveness. We introduce FreqSelect, a lightweight, adaptive module that selectively filters spatial-frequency bands before encoding. By dynamically emphasizing frequencies that are most predictive of brain activity and suppressing those that are uninformative, FreqSelect acts as a content-aware gate between image features and natural data. It integrates seamlessly into standard very deep VAE-diffusion pipelines and requires no additional supervision. Evaluated on the Natural Scenes dataset, FreqSelect consistently improves reconstruction quality across both low- and high-level metrics. Beyond performance gains, the learned frequency-selection patterns offer interpretable insights into how different visual frequencies are represented in the brain. Our method generalizes across subjects and scenes, and holds promise for extension to other neuroimaging modalities, offering a principled approach to enhancing both decoding accuracy and neuroscientific interpretability.
zh

[CV-241] Mutual Evidential Deep Learning for Medical Image Segmentation

【速读】:该论文旨在解决半监督医学分割中因低质量伪标签导致的模型性能下降问题,特别是现有协同学习框架由于伪标签集成策略的平均性质而无法有效评估不同来源伪标签可靠性的问题。其解决方案的关键在于提出一种互证深度学习(Mutual Evidential Deep Learning, MEDL)框架,从两个方面提升伪标签生成的可靠性:首先,通过不同架构的网络为未标记样本生成互补证据,并采用改进的类感知互证融合策略引导来自不同架构网络的证据预测的可信合成;其次,利用融合证据中的不确定性,设计了一种基于渐近费舍尔信息的互证学习策略,使模型能够优先关注可靠性较高的伪标签样本,并在数据不确定性高的情况下避免对误标类别的过度惩罚。

链接: https://arxiv.org/abs/2505.12418
作者: Yuanpeng He,Yali Bi,Lijian Li,Chi-Man Pun,Wenpin Jiao,Zhi Jin
机构: Peking University (北京大学); Southwest University (西南大学); University of Macau (澳门大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing semi-supervised medical segmentation co-learning frameworks have realized that model performance can be diminished by the biases in model recognition caused by low-quality pseudo-labels. Due to the averaging nature of their pseudo-label integration strategy, they fail to explore the reliability of pseudo-labels from different sources. In this paper, we propose a mutual evidential deep learning (MEDL) framework that offers a potentially viable solution for pseudo-label generation in semi-supervised learning from two perspectives. First, we introduce networks with different architectures to generate complementary evidence for unlabeled samples and adopt an improved class-aware evidential fusion to guide the confident synthesis of evidential predictions sourced from diverse architectural networks. Second, utilizing the uncertainty in the fused evidence, we design an asymptotic Fisher information-based evidential learning strategy. This strategy enables the model to initially focus on unlabeled samples with more reliable pseudo-labels, gradually shifting attention to samples with lower-quality pseudo-labels while avoiding over-penalization of mislabeled classes in high data uncertainty samples. Additionally, for labeled data, we continue to adopt an uncertainty-driven asymptotic learning strategy, gradually guiding the model to focus on challenging voxels. Extensive experiments on five mainstream datasets have demonstrated that MEDL achieves state-of-the-art performance.
zh

[CV-242] Attention-Enhanced U-Net for Accurate Segmentation of COVID-19 Infected Lung Regions in CT Scans

【速读】:该论文旨在解决COVID-19患者肺部感染区域在CT图像中的自动分割问题(automatic segmentation of infected lung regions in COVID-19 CT scans)。其关键解决方案是采用一种改进的U-Net架构,结合注意力机制(attention mechanisms)、数据增强(data augmentation)和后处理技术(postprocessing techniques),从而提高了分割的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.12298
作者: Amal Lahchim(University of Kragujevac),Lazar Davic(University of Kragujevac)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures, created using Google Colab and PyTorch. Compares segmentation models for COVID-19 CT data

点击查看摘要

Abstract:In this study, we propose a robust methodology for automatic segmentation of infected lung regions in COVID-19 CT scans using convolutional neural networks. The approach is based on a modified U-Net architecture enhanced with attention mechanisms, data augmentation, and postprocessing techniques. It achieved a Dice coefficient of 0.8658 and mean IoU of 0.8316, outperforming other methods. The dataset was sourced from public repositories and augmented for diversity. Results demonstrate superior segmentation performance. Future work includes expanding the dataset, exploring 3D segmentation, and preparing the model for clinical deployment.
zh

[CV-243] OpenPros: A Large-Scale Dataset for Limited View Prostate Ultrasound Computed Tomography

【速读】:该论文旨在解决前列腺癌早期检测中传统经直肠超声(TRUS)方法灵敏度低以及超声计算断层扫描(USCT)在临床实施中面临的技术挑战,尤其是在解剖结构受限的有限视角采集条件下。其解决方案的关键在于构建了OpenPros数据集,这是首个专为有限视角前列腺USCT设计的大规模基准数据集,包含超过280,000对由真实临床MRI/CT扫描和离体超声测量生成的、具有解剖学准确性的三维数字前列腺模型所驱动的二维声速(SOS)幻影及其对应的超声全波形数据,并通过先进的有限差分时域和龙格-库塔声波求解器进行临床现实配置下的仿真。

链接: https://arxiv.org/abs/2505.12261
作者: Hanchen Wang,Yixuan Wu,Yinan Feng,Peng Jin,Shihang Feng,Yiming Mao,James Wiskin,Baris Turkbey,Peter A. Pinto,Bradford J. Wood,Songting Luo,Yinpeng Chen,Emad Boctor,Youzuo Lin
机构: University of North Carolina at Chapel Hill; Johns Hopkins University; The Pennsylvania State University; QT Imaging, Inc.; National Institutes of Health; Iowa State University; Google DeepMind
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prostate cancer is one of the most common and lethal cancers among men, making its early detection critically important. Although ultrasound imaging offers greater accessibility and cost-effectiveness compared to MRI, traditional transrectal ultrasound methods suffer from low sensitivity, especially in detecting anteriorly located tumors. Ultrasound computed tomography provides quantitative tissue characterization, but its clinical implementation faces significant challenges, particularly under anatomically constrained limited-angle acquisition conditions specific to prostate imaging. To address these unmet needs, we introduce OpenPros, the first large-scale benchmark dataset explicitly developed for limited-view prostate USCT. Our dataset includes over 280,000 paired samples of realistic 2D speed-of-sound (SOS) phantoms and corresponding ultrasound full-waveform data, generated from anatomically accurate 3D digital prostate models derived from real clinical MRI/CT scans and ex vivo ultrasound measurements, annotated by medical experts. Simulations are conducted under clinically realistic configurations using advanced finite-difference time-domain and Runge-Kutta acoustic wave solvers, both provided as open-source components. Through comprehensive baseline experiments, we demonstrate that state-of-the-art deep learning methods surpass traditional physics-based approaches in both inference efficiency and reconstruction accuracy. Nevertheless, current deep learning models still fall short of delivering clinically acceptable high-resolution images with sufficient accuracy. By publicly releasing OpenPros, we aim to encourage the development of advanced machine learning algorithms capable of bridging this performance gap and producing clinically usable, high-resolution, and highly accurate prostate ultrasound images. The dataset is publicly accessible at this https URL.
zh

[CV-244] PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning MICCAI2025

【速读】:该论文旨在解决在缺乏大量标注数据的情况下,如何提升视网膜图像分析的泛化能力和诊断性能的问题。传统方法依赖临床报告进行监督学习,但获取此类报告成本高且困难,而患者元数据(metadata)则广泛可用,具有潜在价值。该研究提出PRETI,其关键在于将元数据感知学习与鲁棒的自监督表示学习相结合,引入可学习元数据嵌入(Learnable Metadata Embedding, LME)以动态优化元数据表示,并构建基于患者级别的图像对以增强对非临床变异的鲁棒性;此外,通过视网膜感知自适应掩码(Retina-Aware Adaptive Masking, RAAM)策略,选择性地在视网膜区域内应用掩码并动态调整掩码比例,从而提升视网膜图像表征的质量。

链接: https://arxiv.org/abs/2505.12233
作者: Yeonkyung Lee,Woojung Han,Youngjun Jun,Hyeonmin Kim,Jungkyung Cho,Seong Jae Hwang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2025 early accept

点击查看摘要

Abstract:Retinal foundation models have significantly advanced retinal image analysis by leveraging self-supervised learning to reduce dependence on labeled data while achieving strong generalization. Many recent approaches enhance retinal image understanding using report supervision, but obtaining clinical reports is often costly and challenging. In contrast, metadata (e.g., age, gender) is widely available and serves as a valuable resource for analyzing disease progression. To effectively incorporate patient-specific information, we propose PRETI, a retinal foundation model that integrates metadata-aware learning with robust self-supervised representation learning. We introduce Learnable Metadata Embedding (LME), which dynamically refines metadata representations. Additionally, we construct patient-level data pairs, associating images from the same individual to improve robustness against non-clinical variations. To further optimize retinal image representation, we propose Retina-Aware Adaptive Masking (RAAM), a strategy that selectively applies masking within the retinal region and dynamically adjusts the masking ratio during training. PRETI captures both global structures and fine-grained pathological details, resulting in superior diagnostic performance. Extensive experiments demonstrate that PRETI achieves state-of-the-art results across diverse diseases and biomarker predictions using in-house and public data, indicating the importance of metadata-guided foundation models in retinal disease analysis. Our code and pretrained model are available at this https URL
zh

[CV-245] CTLformer: A Hybrid Denoising Model Combining Convolutional Layers and Self-Attention for Enhanced CT Image Reconstruction

【速读】:该论文旨在解决低剂量CT(LDCT)图像中显著噪声对图像质量和后续诊断准确性产生的负面影响,特别是针对多尺度特征融合和多样化的噪声分布模式问题。其解决方案的关键在于提出一种名为CTLformer的创新模型,该模型结合了卷积结构与Transformer架构,通过两个核心创新:多尺度注意力机制和动态注意力控制机制,实现了对不同尺度细节与全局结构的有效捕捉,并根据输入图像的噪声特性自适应调整注意力分布,从而提升去噪性能与鲁棒性。

链接: https://arxiv.org/abs/2505.12203
作者: Zhiting Zheng,Shuqi Wu,Wen Ding
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-dose CT (LDCT) images are often accompanied by significant noise, which negatively impacts image quality and subsequent diagnostic accuracy. To address the challenges of multi-scale feature fusion and diverse noise distribution patterns in LDCT denoising, this paper introduces an innovative model, CTLformer, which combines convolutional structures with transformer architecture. Two key innovations are proposed: a multi-scale attention mechanism and a dynamic attention control mechanism. The multi-scale attention mechanism, implemented through the Token2Token mechanism and self-attention interaction modules, effectively captures both fine details and global structures at different scales, enhancing relevant features and suppressing noise. The dynamic attention control mechanism adapts the attention distribution based on the noise characteristics of the input image, focusing on high-noise regions while preserving details in low-noise areas, thereby enhancing robustness and improving denoising performance. Furthermore, CTLformer integrates convolutional layers for efficient feature extraction and uses overlapping inference to mitigate boundary artifacts, further strengthening its denoising capability. Experimental results on the 2016 National Institutes of Health AAPM Mayo Clinic LDCT Challenge dataset demonstrate that CTLformer significantly outperforms existing methods in both denoising performance and model efficiency, greatly improving the quality of LDCT images. The proposed CTLformer not only provides an efficient solution for LDCT denoising but also shows broad potential in medical image analysis, especially for clinical applications dealing with complex noise patterns.
zh

[CV-246] HISTAI: An Open-Source Large-Scale Whole Slide Image Dataset for Computational Pathology

【速读】:该论文试图解决当前公开的Whole Slide Image (WSI)数据集在规模、组织多样性及全面临床元数据方面的不足,这些问题限制了人工智能模型的鲁棒性和泛化能力。解决方案的关键在于引入HISTAI数据集,这是一个大规模、多模态、开源的WSI集合,包含来自多种组织类型的超过60,000张切片,并为每个病例提供了丰富的临床元数据,包括诊断、人口统计信息、详细的病理标注和标准化诊断编码。

链接: https://arxiv.org/abs/2505.12120
作者: Dmitry Nechaev,Alexey Pchelnikov,Ekaterina Ivanova
机构: HistAI
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of clinically relevant computational pathology solutions. The dataset can be accessed at this https URL.
zh

[CV-247] NTIRE 2025 Challenge on Efficient Burst HDR and Restoration: Datasets Methods and Results

【速读】:该论文旨在解决高效多帧高动态范围(HDR)成像与修复问题,通过开发能够在严格效率约束下融合多帧RAW图像的算法来提升图像质量。解决方案的关键在于设计轻量级模型架构,在参数数量少于30百万和计算预算低于4.0万亿浮点运算(FLOPs)的前提下,有效处理噪声和未对齐的多帧RAW数据,从而实现高质量的HDR重建。

链接: https://arxiv.org/abs/2505.12089
作者: Sangmin Lee,Eunpil Park,Angel Canelo,Hyunhee Park,Youngjo Kim,Hyung-Ju Chun,Xin Jin,Chongyi Li,Chun-Le Guo,Radu Timofte,Qi Wu,Tianheng Qiu,Yuchun Dong,Shenglin Ding,Guanghua Pan,Weiyu Zhou,Tao Hu,Yixu Feng,Duwei Dai,Yu Cao,Peng Wu,Wei Dong,Yanning Zhang,Qingsen Yan,Simon J. Larsen,Ruixuan Jiang,Senyan Xu,Xingbo Wang,Xin Lu,Marcos V. Conde,Javier Abad-Hernandez,Alvaro Garcıa-Lara,Daniel Feijoo,Alvaro Garcıa,Zeyu Xiao,Zhuoyuan Li
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper reviews the NTIRE 2025 Efficient Burst HDR and Restoration Challenge, which aims to advance efficient multi-frame high dynamic range (HDR) and restoration techniques. The challenge is based on a novel RAW multi-frame fusion dataset, comprising nine noisy and misaligned RAW frames with various exposure levels per scene. Participants were tasked with developing solutions capable of effectively fusing these frames while adhering to strict efficiency constraints: fewer than 30 million model parameters and a computational budget under 4.0 trillion FLOPs. A total of 217 participants registered, with six teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 43.22 dB, showcasing the potential of novel methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers and practitioners in efficient burst HDR and restoration.
zh

[CV-248] Bayesian Deep Learning Approaches for Uncertainty-Aware Retinal OCT Image Segmentation for Multiple Sclerosis

【速读】:该论文试图解决光学相干断层扫描(Optical Coherence Tomography, OCT)中视网膜层分割的自动化问题,该过程因耗时且易受人为偏差影响而限制了诊断的准确性和可靠性。传统深度学习方法在临床和统计学界的采纳率较低,主要由于缺乏不确定性估计,导致模型可能产生“自信错误”的预测。解决方案的关键在于应用贝叶斯卷积神经网络(Bayesian Convolutional Neural Networks, BCNNs),通过生成不确定性图来识别推理过程中存在记录伪影(如噪声或校准错误)的样本,并提供对重要次级测量值(如层厚度)的不确定性估计,从而提升临床适用性、统计稳健性和分割性能。

链接: https://arxiv.org/abs/2505.12061
作者: Samuel T. M. Ball
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) provides valuable insights in ophthalmology, cardiology, and neurology due to high-resolution, cross-sectional images of the retina. One critical task for ophthalmologists using OCT is delineation of retinal layers within scans. This process is time-consuming and prone to human bias, affecting the accuracy and reliability of diagnoses. Previous efforts to automate delineation using deep learning face challenges in uptake from clinicians and statisticians due to the absence of uncertainty estimation, leading to “confidently wrong” models via hallucinations. In this study, we address these challenges by applying Bayesian convolutional neural networks (BCNNs) to segment an openly available OCT imaging dataset containing 35 human retina OCTs split between healthy controls and patients with multiple sclerosis. Our findings demonstrate that Bayesian models can be used to provide uncertainty maps of the segmentation, which can further be used to identify highly uncertain samples that exhibit recording artefacts such as noise or miscalibration at inference time. Our method also allows for uncertainty-estimation for important secondary measurements such as layer thicknesses, that are medically relevant for patients. We show that these features come in addition to greater performance compared to similar work over all delineations; with an overall Dice score of 95.65%. Our work brings greater clinical applicability, statistical robustness, and performance to retinal OCT segmentation.
zh

[CV-249] Joint Manifold Learning and Optimal Transport for Dynamic Imaging

【速读】:该论文旨在解决动态成像中由于时间序列数据和时间点数量有限而导致难以学习有意义模式的问题。其解决方案的关键在于将图像流形的低维性假设与最优传输(Optimal Transport, OT)正则化相结合,通过引入潜在模型表示来促进图像流形表示、时间序列数据和OT先验之间的一致性。这种方法不仅利用了低维性假设缓解样本稀缺问题,还通过OT先验增强了时间演化图像的建模能力。

链接: https://arxiv.org/abs/2505.11913
作者: Sven Dummer,Puru Vaish,Christoph Brune
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic imaging is critical for understanding and visualizing dynamic biological processes in medicine and cell biology. These applications often encounter the challenge of a limited amount of time series data and time points, which hinders learning meaningful patterns. Regularization methods provide valuable prior knowledge to address this challenge, enabling the extraction of relevant information despite the scarcity of time-series data and time points. In particular, low-dimensionality assumptions on the image manifold address sample scarcity, while time progression models, such as optimal transport (OT), provide priors on image development to mitigate the lack of time points. Existing approaches using low-dimensionality assumptions disregard a temporal prior but leverage information from multiple time series. OT-prior methods, however, incorporate the temporal prior but regularize only individual time series, ignoring information from other time series of the same image modality. In this work, we investigate the effect of integrating a low-dimensionality assumption of the underlying image manifold with an OT regularizer for time-evolving images. In particular, we propose a latent model representation of the underlying image manifold and promote consistency between this representation, the time series data, and the OT prior on the time-evolving images. We discuss the advantages of enriching OT interpolations with latent models and integrating OT priors into latent models.
zh

[CV-250] Bridging the Inter-Domain Gap through Low-Level Features for Cross-Modal Medical Image Segmentation

【速读】:该论文试图解决跨模态医学图像分割中的无监督域适应(Unsupervised Domain Adaptation, UDA)问题。其解决方案的关键在于提出了一种模型无关的UDA框架LowBridge,该框架基于一个简单的观察:跨模态图像在描述相同结构时共享一些相似的低级特征(如边缘)。具体而言,首先训练一个生成模型从边缘特征中恢复源域图像,随后在生成的源域图像上训练分割模型;在测试阶段,将目标域图像的边缘特征输入预训练的生成模型以生成源域风格的目标域图像,再通过预训练的分割网络进行分割。

链接: https://arxiv.org/abs/2505.11909
作者: Pengfei Lyu,Pak-Hei Yeung,Xiaosheng Yu,Jing Xia,Jianning Chi,Chengdong Wu,Jagath C. Rajapakse
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:This paper addresses the task of cross-modal medical image segmentation by exploring unsupervised domain adaptation (UDA) approaches. We propose a model-agnostic UDA framework, LowBridge, which builds on a simple observation that cross-modal images share some similar low-level features (e.g., edges) as they are depicting the same structures. Specifically, we first train a generative model to recover the source images from their edge features, followed by training a segmentation model on the generated source images, separately. At test time, edge features from the target images are input to the pretrained generative model to generate source-style target domain images, which are then segmented using the pretrained segmentation network. Despite its simplicity, extensive experiments on various publicly available datasets demonstrate that \proposed achieves state-of-the-art performance, outperforming eleven existing UDA approaches under different settings. Notably, further ablation studies show that \proposed is agnostic to different types of generative and segmentation models, suggesting its potential to be seamlessly plugged with the most advanced models to achieve even more outstanding results in the future. The code is available at this https URL.
zh

[CV-251] Patient-Specific Autoregressive Models for Organ Motion Prediction in Radiotherapy

【速读】:该论文试图解决放射治疗中由于呼吸和其他生理因素导致的器官运动预测问题,现有方法主要依赖于主成分分析(PCA)进行变形分析,但其对配准质量高度依赖,并难以捕捉周期性时间动态。解决方案的关键在于将器官运动预测重新建模为自回归过程,这一过程在自然语言处理(NLP)中广泛应用,能够基于先前输入预测后续阶段,从而更有效地捕捉患者特定的运动模式。通过使用4D CT扫描数据训练自回归模型,该方法在肺和心脏运动预测上优于现有基准,展示了其在CT图像中捕捉运动动态的有效性。

链接: https://arxiv.org/abs/2505.11832
作者: Yuxiang Lai,Jike Zhong,Vanessa Su,Xiaofeng Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiotherapy often involves a prolonged treatment period. During this time, patients may experience organ motion due to breathing and other physiological factors. Predicting and modeling this motion before treatment is crucial for ensuring precise radiation delivery. However, existing pre-treatment organ motion prediction methods primarily rely on deformation analysis using principal component analysis (PCA), which is highly dependent on registration quality and struggles to capture periodic temporal dynamics for motion this http URL this paper, we observe that organ motion prediction closely resembles an autoregressive process, a technique widely used in natural language processing (NLP). Autoregressive models predict the next token based on previous inputs, naturally aligning with our objective of predicting future organ motion phases. Building on this insight, we reformulate organ motion prediction as an autoregressive process to better capture patient-specific motion patterns. Specifically, we acquire 4D CT scans for each patient before treatment, with each sequence comprising multiple 3D CT phases. These phases are fed into the autoregressive model to predict future phases based on prior phase motion patterns. We evaluate our method on a real-world test set of 4D CT scans from 50 patients who underwent radiotherapy at our institution and a public dataset containing 4D CT scans from 20 patients (some with multiple scans), totaling over 1,300 3D CT phases. The performance in predicting the motion of the lung and heart surpasses existing benchmarks, demonstrating its effectiveness in capturing motion dynamics from CT images. These results highlight the potential of our method to improve pre-treatment planning in radiotherapy, enabling more precise and adaptive radiation delivery.
zh

[CV-252] MedVKAN: Efficient Feature Extraction with Mamba and KAN for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中传统卷积神经网络(CNN)感受野有限以及基于Transformer的模型因二次计算复杂度导致的可扩展性问题。其解决方案的关键在于引入状态空间模型Mamba和Kolmogorov-Arnold Network(KAN),利用Mamba的近线性复杂度与长程依赖捕捉能力,以及KAN通过可学习激活函数提升非线性表达能力,构建了MedVKAN模型,其中包含EFC-KAN模块和VKAN模块,以提高特征提取效率与性能。

链接: https://arxiv.org/abs/2505.11797
作者: Hancan Zhu,Jinhao Chen,Guanghua He
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation relies heavily on convolutional neural networks (CNNs) and Transformer-based models. However, CNNs are constrained by limited receptive fields, while Transformers suffer from scalability challenges due to their quadratic computational complexity. To address these limitations, recent advances have explored alternative architectures. The state-space model Mamba offers near-linear complexity while capturing long-range dependencies, and the Kolmogorov-Arnold Network (KAN) enhances nonlinear expressiveness by replacing fixed activation functions with learnable ones. Building on these strengths, we propose MedVKAN, an efficient feature extraction model integrating Mamba and KAN. Specifically, we introduce the EFC-KAN module, which enhances KAN with convolutional operations to improve local pixel interaction. We further design the VKAN module, integrating Mamba with EFC-KAN as a replacement for Transformer modules, significantly improving feature extraction. Extensive experiments on five public medical image segmentation datasets show that MedVKAN achieves state-of-the-art performance on four datasets and ranks second on the remaining one. These results validate the potential of Mamba and KAN for medical image segmentation while introducing an innovative and computationally efficient feature extraction framework. The code is available at: this https URL.
zh

[CV-253] BrainNetMLP: An Efficient and Effective Baseline for Functional Brain Network Classification

【速读】:该论文试图解决功能性脑网络分类中模型复杂度增加但性能提升不显著的问题,即是否增加模型复杂度必然带来更高的分类准确率。其解决方案的关键在于重新审视最简单的深度学习架构——多层感知机(Multi-Layer Perceptron, MLP),并提出一种纯MLP-based的方法BrainNetMLP,该方法通过引入双分支结构同时捕捉空间连通性和频谱信息,实现精确的时空特征融合,从而在保持高效计算和较少参数的同时达到先进的分类性能。

链接: https://arxiv.org/abs/2505.11538
作者: Jiacheng Hou,Zhenjie Song,Ercan Engin Kuruoglu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: V1.0

点击查看摘要

Abstract:Recent studies have made great progress in functional brain network classification by modeling the brain as a network of Regions of Interest (ROIs) and leveraging their connections to understand brain functionality and diagnose mental disorders. Various deep learning architectures, including Convolutional Neural Networks, Graph Neural Networks, and the recent Transformer, have been developed. However, despite the increasing complexity of these models, the performance gain has not been as salient. This raises a question: Does increasing model complexity necessarily lead to higher classification accuracy? In this paper, we revisit the simplest deep learning architecture, the Multi-Layer Perceptron (MLP), and propose a pure MLP-based method, named BrainNetMLP, for functional brain network classification, which capitalizes on the advantages of MLP, including efficient computation and fewer parameters. Moreover, BrainNetMLP incorporates a dual-branch structure to jointly capture both spatial connectivity and spectral information, enabling precise spatiotemporal feature fusion. We evaluate our proposed BrainNetMLP on two public and popular brain network classification datasets, the Human Connectome Project (HCP) and the Autism Brain Imaging Data Exchange (ABIDE). Experimental results demonstrate pure MLP-based methods can achieve state-of-the-art performance, revealing the potential of MLP-based models as more efficient yet effective alternatives in functional brain network classification. The code will be available at this https URL.
zh

[CV-254] Deep Unrolled Meta-Learning for Multi-Coil and Multi-Modality MRI with Adaptive Optimization

【速读】:该论文旨在解决加速磁共振成像(MRI)中多线圈重建与跨模态合成的问题,特别是针对欠采样数据和缺失模态的挑战。其解决方案的关键在于提出一种统一的深度元学习框架,将可证明收敛的优化算法展开为结构化的神经网络架构,每一阶段模拟带有外推的自适应前向-后向方案,从而在合理框架下融合数据保真度与非凸正则化。此外,通过集成元学习,模型能够利用任务特定的元知识快速适应未见过的采样模式和模态组合,提升泛化能力。

链接: https://arxiv.org/abs/2505.11518
作者: Merham Fouladvand,Peuroly Batra
机构: 未知
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a unified deep meta-learning framework for accelerated magnetic resonance imaging (MRI) that jointly addresses multi-coil reconstruction and cross-modality synthesis. Motivated by the limitations of conventional methods in handling undersampled data and missing modalities, our approach unrolls a provably convergent optimization algorithm into a structured neural network architecture. Each phase of the network mimics a step of an adaptive forward-backward scheme with extrapolation, enabling the model to incorporate both data fidelity and nonconvex regularization in a principled manner. To enhance generalization across different acquisition settings, we integrate meta-learning, which enables the model to rapidly adapt to unseen sampling patterns and modality combinations using task-specific meta-knowledge. The proposed method is evaluated on the open source datasets, showing significant improvements in PSNR and SSIM over conventional supervised learning, especially under aggressive undersampling and domain shifts. Our results demonstrate the synergy of unrolled optimization, task-aware meta-learning, and modality fusion, offering a scalable and generalizable solution for real-world clinical MRI reconstruction.
zh

人工智能

[AI-0] Learnware of Language Models: Specialized Small Language Models Can Do Big

【速读】:该论文试图解决在专业领域中,由于数据稀缺、隐私问题和高计算成本导致大型语言模型(Large Language Models, LLMs)性能受限的问题,同时探索如何有效利用专门的小型语言模型(Specialized Small Language Models, SLMs)。解决方案的关键在于引入学习软件(Learnware)范式,通过模型规格(Specifications)来识别和复用最适合特定任务的SLMs,从而在保护用户数据隐私的前提下实现模型的协同使用与最大化利用。

链接: https://arxiv.org/abs/2505.13425
作者: Zhi-Hao Tan,Zi-Chen Zhao,Hao-Yu Shi,Xin-Yu Zhang,Peng Tan,Yang Yu,Zhi-Hua Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The learnware paradigm offers a novel approach to machine learning by enabling users to reuse a set of well-trained models for tasks beyond the models’ original purposes. It eliminates the need to build models from scratch, instead relying on specifications (representations of a model’s capabilities) to identify and leverage the most suitable models for new tasks. While learnware has proven effective in many scenarios, its application to language models has remained largely unexplored. At the same time, large language models (LLMs) have demonstrated remarkable universal question-answering abilities, yet they face challenges in specialized scenarios due to data scarcity, privacy concerns, and high computational costs, thus more and more specialized small language models (SLMs) are being trained for specific domains. To address these limitations systematically, the learnware paradigm provides a promising solution by enabling maximum utilization of specialized SLMs, and allowing users to identify and reuse them in a collaborative and privacy-preserving manner. This paper presents a preliminary attempt to apply the learnware paradigm to language models. We simulated a learnware system comprising approximately 100 learnwares of specialized SLMs with 8B parameters, fine-tuned across finance, healthcare, and mathematics domains. Each learnware contains an SLM and a specification, which enables users to identify the most relevant models without exposing their own data. Experimental results demonstrate promising performance: by selecting one suitable learnware for each task-specific inference, the system outperforms the base SLMs on all benchmarks. Compared to LLMs, the system outperforms Qwen1.5-110B, Qwen2.5-72B, and Llama3.1-70B-Instruct by at least 14% in finance domain tasks, and surpasses Flan-PaLM-540B (ranked 7th on the Open Medical LLM Leaderboard) in medical domain tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.13425 [cs.LG] (or arXiv:2505.13425v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.13425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-1] AutoMathKG: The automated mathematical knowledge graph based on LLM and vector database

【速读】:该论文旨在解决数学知识图谱(Mathematical Knowledge Graph, KG)构建过程中存在的两个主要问题:一是现有工作受限于语料库的完整性,常需舍弃或人工补充不完整知识;二是难以实现多种知识源的全自动整合。其解决方案的关键在于提出AutoMathKG,该系统将数学视为由定义(Definition)、定理(Theorem)和问题(Problem)实体构成的有向图,并通过大语言模型(LLM)结合上下文学习进行数据增强,以提升实体与关系的质量。同时,引入MathVD向量数据库和两种自动更新机制——知识补全机制与知识融合机制,实现了数学知识图谱的高质量、广覆盖与自动更新能力。

链接: https://arxiv.org/abs/2505.13406
作者: Rong Bian,Yu Geng,Zijian Yang,Bing Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A mathematical knowledge graph (KG) presents knowledge within the field of mathematics in a structured manner. Constructing a math KG using natural language is an essential but challenging task. There are two major limitations of existing works: first, they are constrained by corpus completeness, often discarding or manually supplementing incomplete knowledge; second, they typically fail to fully automate the integration of diverse knowledge sources. This paper proposes AutoMathKG, a high-quality, wide-coverage, and multi-dimensional math KG capable of automatic updates. AutoMathKG regards mathematics as a vast directed graph composed of Definition, Theorem, and Problem entities, with their reference relationships as edges. It integrates knowledge from ProofWiki, textbooks, arXiv papers, and TheoremQA, enhancing entities and relationships with large language models (LLMs) via in-context learning for data augmentation. To search for similar entities, MathVD, a vector database, is built through two designed embedding strategies using SBERT. To automatically update, two mechanisms are proposed. For knowledge completion mechanism, Math LLM is developed to interact with AutoMathKG, providing missing proofs or solutions. For knowledge fusion mechanism, MathVD is used to retrieve similar entities, and LLM is used to determine whether to merge with a candidate or add as a new entity. A wide range of experiments demonstrate the advanced performance and broad applicability of the AutoMathKG system, including superior reachability query results in MathVD compared to five baselines and robust mathematical reasoning capability in Math LLM.
zh

[AI-2] Robin: A multi-agent system for automating scientific discovery

【速读】:该论文试图解决科学发现过程中各阶段(背景研究、假设生成、实验设计与数据分析)尚未被完全自动化的难题,旨在实现科学探索的半自主化。解决方案的关键在于引入Robin系统,这是一个多智能体系统,能够整合文献检索代理与数据分析代理,从而实现假设生成、实验设计、结果解释及假设更新的全流程自动化。通过该系统,研究人员成功识别出一种治疗干性年龄相关性黄斑变性(dAMD)的新疗法,并验证了其潜在机制。

链接: https://arxiv.org/abs/2505.13400
作者: Ali Essam Ghareeb,Benjamin Chang,Ludovico Mitchener,Angela Yiu,Caralyn J. Szostkiewicz,Jon M. Laurent,Muhammed T. Razzak,Andrew D. White,Michaela M. Hinks,Samuel G. Rodriques
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Scientific discovery is driven by the iterative process of background research, hypothesis generation, experimentation, and data analysis. Despite recent advancements in applying artificial intelligence to scientific discovery, no system has yet automated all of these stages in a single workflow. Here, we introduce Robin, the first multi-agent system capable of fully automating the key intellectual steps of the scientific process. By integrating literature search agents with data analysis agents, Robin can generate hypotheses, propose experiments, interpret experimental results, and generate updated hypotheses, achieving a semi-autonomous approach to scientific discovery. By applying this system, we were able to identify a novel treatment for dry age-related macular degeneration (dAMD), the major cause of blindness in the developed world. Robin proposed enhancing retinal pigment epithelium phagocytosis as a therapeutic strategy, and identified and validated a promising therapeutic candidate, ripasudil. Ripasudil is a clinically-used rho kinase (ROCK) inhibitor that has never previously been proposed for treating dAMD. To elucidate the mechanism of ripasudil-induced upregulation of phagocytosis, Robin then proposed and analyzed a follow-up RNA-seq experiment, which revealed upregulation of ABCA1, a critical lipid efflux pump and possible novel target. All hypotheses, experimental plans, data analyses, and data figures in the main text of this report were produced by Robin. As the first AI system to autonomously discover and validate a novel therapeutic candidate within an iterative lab-in-the-loop framework, Robin establishes a new paradigm for AI-driven scientific discovery.
zh

[AI-3] How Adding Metacognitive Requirements in Support of AI Feedback in Practice Exams Transforms Student Learning Behaviors

【速读】:该论文试图解决在大规模本科STEM课程中提供个性化、详细反馈的持续性挑战。其解决方案的关键在于构建一个整合AI生成反馈与针对性教材引用的实践考试系统,通过要求学生解释答案并声明信心水平来促进元认知行为,并利用OpenAI的GPT-4o生成个性化反馈,同时引导学生查阅相关教材内容。

链接: https://arxiv.org/abs/2505.13381
作者: Mak Ahmad,Prerna Ravi,David Karger,Marc Facciotti
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, to appear in Proceedings of the Twelfth ACM Conference on Learning @ Scale (L@S 2025), July 2025, Palermo, Italy

点击查看摘要

Abstract:Providing personalized, detailed feedback at scale in large undergraduate STEM courses remains a persistent challenge. We present an empirically evaluated practice exam system that integrates AI generated feedback with targeted textbook references, deployed in a large introductory biology course. Our system encourages metacognitive behavior by asking students to explain their answers and declare their confidence. It uses OpenAI’s GPT-4o to generate personalized feedback based on this information, while directing them to relevant textbook sections. Through interaction logs from consenting participants across three midterms (541, 342, and 413 students respectively), totaling 28,313 question-student interactions across 146 learning objectives, along with 279 surveys and 23 interviews, we examined the system’s impact on learning outcomes and engagement. Across all midterms, feedback types showed no statistically significant performance differences, though some trends suggested potential benefits. The most substantial impact came from the required confidence ratings and explanations, which students reported transferring to their actual exam strategies. About 40 percent of students engaged with textbook references when prompted by feedback – far higher than traditional reading rates. Survey data revealed high satisfaction (mean rating 4.1 of 5), with 82.1 percent reporting increased confidence on practiced midterm topics, and 73.4 percent indicating they could recall and apply specific concepts. Our findings suggest that embedding structured reflection requirements may be more impactful than sophisticated feedback mechanisms.
zh

[AI-4] Exploiting Symbolic Heuristics for the Synthesis of Domain-Specific Temporal Planning Guidance using Reinforcement Learning

【速读】:该论文旨在解决在固定领域下,利用强化学习(Reinforcement Learning, RL)生成启发式指导以提升时间规划器性能的问题,特别是在训练问题集已知的情况下。其关键在于通过符号启发式(symbolic heuristic)信息的引入,优化强化学习与规划过程中的策略合成与执行。具体而言,解决方案包括:形式化不同的奖励机制以缓解因处理潜在无限状态马尔可夫决策过程(MDP)而产生的轨迹截断问题;学习现有符号启发式的残差作为“修正项”而非从头学习整个启发式;以及结合符号启发式与学习到的启发式,采用多队列规划方法平衡系统性搜索与不完美学习信息之间的关系。

链接: https://arxiv.org/abs/2505.13372
作者: Irene Brugnara,Alessandro Valentini,Andrea Micheli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work investigated the use of Reinforcement Learning (RL) for the synthesis of heuristic guidance to improve the performance of temporal planners when a domain is fixed and a set of training problems (not plans) is given. The idea is to extract a heuristic from the value function of a particular (possibly infinite-state) MDP constructed over the training problems. In this paper, we propose an evolution of this learning and planning framework that focuses on exploiting the information provided by symbolic heuristics during both the RL and planning phases. First, we formalize different reward schemata for the synthesis and use symbolic heuristics to mitigate the problems caused by the truncation of episodes needed to deal with the potentially infinite MDP. Second, we propose learning a residual of an existing symbolic heuristic, which is a “correction” of the heuristic value, instead of eagerly learning the whole heuristic from scratch. Finally, we use the learned heuristic in combination with a symbolic heuristic using a multiple-queue planning approach to balance systematic search with imperfect learned information. We experimentally compare all the approaches, highlighting their strengths and weaknesses and significantly advancing the state of the art for this planning and learning schema. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.13372 [cs.AI] (or arXiv:2505.13372v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.13372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-5] One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling

【速读】:该论文旨在解决基于扩散的生成模型在迭代采样过程中计算成本过高的问题,其解决方案的关键在于提出一种基于Koopman理论的离线蒸馏方法——Koopman Distillation Model (KDM)。KDM通过将噪声输入编码到嵌入空间,并利用学习到的线性算子进行前向传播,再通过解码器重建干净样本,从而实现单步生成,同时保持语义一致性。该方法的核心创新在于利用Koopman理论对非线性动力系统进行线性表示,从而提升生成效率并保证生成质量。

链接: https://arxiv.org/abs/2505.13358
作者: Nimrod Berman,Ilan Naiman,Moshe Eliasof,Hedi Zisling,Omri Azencot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model KDM, a novel offline distillation approach grounded in Koopman theory-a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. Empirically, KDM achieves state-of-the-art performance across standard offline distillation benchmarks, improving FID scores by up to 40% in a single generation step. All implementation details and code for the experimental setups are provided in our GitHub - this https URL, or in our project page - this https URL.
zh

[AI-6] Multi-Armed Bandits Meet Large Language Models

【速读】:该论文试图解决如何将强化学习中的多臂老虎机算法(Bandit algorithms)与大语言模型(Large Language Models, LLMs)进行有效结合,以提升决策制定和自然语言处理任务的性能问题。其解决方案的关键在于利用老虎机算法在探索与利用之间的平衡能力优化LLMs的微调、提示工程和自适应响应生成,并通过LLMs的上下文理解、动态适应能力和自然语言推理来增强老虎机算法的策略选择与决策效率。

链接: https://arxiv.org/abs/2505.13355
作者: Djallel Bouneffouf,Raphael Feraud
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bandit algorithms and Large Language Models (LLMs) have emerged as powerful tools in artificial intelligence, each addressing distinct yet complementary challenges in decision-making and natural language processing. This survey explores the synergistic potential between these two fields, highlighting how bandit algorithms can enhance the performance of LLMs and how LLMs, in turn, can provide novel insights for improving bandit-based decision-making. We first examine the role of bandit algorithms in optimizing LLM fine-tuning, prompt engineering, and adaptive response generation, focusing on their ability to balance exploration and exploitation in large-scale learning tasks. Subsequently, we explore how LLMs can augment bandit algorithms through advanced contextual understanding, dynamic adaptation, and improved policy selection using natural language reasoning. By providing a comprehensive review of existing research and identifying key challenges and opportunities, this survey aims to bridge the gap between bandit algorithms and LLMs, paving the way for innovative applications and interdisciplinary research in AI.
zh

[AI-7] OPA-Pack: Object-Property-Aware Robotic Bin Packing

【速读】:该论文旨在解决传统机器人分拣包装中仅关注物体形状以优化包装紧凑性,而忽视物体属性(如易碎性、可食用性和化学性质)的问题。其关键解决方案是提出OPA-Pack框架,该框架通过引入物体属性感知的包装规划方法,结合检索增强生成与思维链推理的物体属性识别方案,以及基于属性嵌入层、易碎性高度图和避让高度图的OPA-Net网络结构,实现对不相容物体对的分离和对易碎物体压力的降低,同时保持良好的包装紧凑性。

链接: https://arxiv.org/abs/2505.13339
作者: Jia-Hui Pan,Yeok Tatt Cheah,Zhengzhe Liu,Ka-Hei Hui,Xiaojie Gao,Pheng-Ann Heng,Yun-Hui Liu,Chi-Wing Fu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Transactions on Robotics (TRO) on Feb. 10, 2025

点击查看摘要

Abstract:Robotic bin packing aids in a wide range of real-world scenarios such as e-commerce and warehouses. Yet, existing works focus mainly on considering the shape of objects to optimize packing compactness and neglect object properties such as fragility, edibility, and chemistry that humans typically consider when packing objects. This paper presents OPA-Pack (Object-Property-Aware Packing framework), the first framework that equips the robot with object property considerations in planning the object packing. Technical-wise, we develop a novel object property recognition scheme with retrieval-augmented generation and chain-of-thought reasoning, and build a dataset with object property annotations for 1,032 everyday objects. Also, we formulate OPA-Net, aiming to jointly separate incompatible object pairs and reduce pressure on fragile objects, while compacting the packing. Further, OPA-Net consists of a property embedding layer to encode the property of candidate objects to be packed, together with a fragility heightmap and an avoidance heightmap to keep track of the packed objects. Then, we design a reward function and adopt a deep Q-learning scheme to train OPA-Net. Experimental results manifest that OPA-Pack greatly improves the accuracy of separating incompatible object pairs (from 52% to 95%) and largely reduces pressure on fragile objects (by 29.4%), while maintaining good packing compactness. Besides, we demonstrate the effectiveness of OPA-Pack on a real packing platform, showcasing its practicality in real-world scenarios.
zh

[AI-8] Recommender Systems for Democracy: Toward Adversarial Robustness in Voting Advice Applications IJCAI2025

【速读】:该论文试图解决生成式 AI (Generative AI) 驱动的投票建议应用(Voting Advice Applications, VAAs)在面临恶意实体攻击时可能对民主进程造成的潜在风险。研究揭示了11种操纵策略,并通过瑞士主要VAAs——Smartvote的数据评估了其影响,发现调整应用参数、选择性地使用问卷项目以及微调政党或候选人的回答均可显著改变推荐频率。解决方案的关键在于提出VAAs应满足的对抗鲁棒性属性,引入用于评估不同匹配方法抗 manipulation 能力的实证指标,并探索缓解操纵效应的研究方向,从而为未来安全可靠的AI驱动VAAs提供保障。

链接: https://arxiv.org/abs/2505.13329
作者: Frédéric Berdoz,Dustin Brunner,Yann Vonlanthen,Roger Wattenhofer
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: This is the extended version of the paper, accepted at IJCAI 2025

点击查看摘要

Abstract:Voting advice applications (VAAs) help millions of voters understand which political parties or candidates best align with their views. This paper explores the potential risks these applications pose to the democratic process when targeted by adversarial entities. In particular, we expose 11 manipulation strategies and measure their impact using data from Switzerland’s primary VAA, Smartvote, collected during the last two national elections. We find that altering application parameters, such as the matching method, can shift a party’s recommendation frequency by up to 105%. Cherry-picking questionnaire items can increase party recommendation frequency by over 261%, while subtle changes to parties’ or candidates’ responses can lead to a 248% increase. To address these vulnerabilities, we propose adversarial robustness properties VAAs should satisfy, introduce empirical metrics for assessing the resilience of various matching methods, and suggest possible avenues for research toward mitigating the effect of manipulation. Our framework is key to ensuring secure and reliable AI-based VAAs poised to emerge in the near future.
zh

[AI-9] KHRONOS: a Kernel-Based Neural Architecture for Rapid Resource-Efficient Scientific Computation

【速读】:该论文旨在解决高维物理系统建模中面临的维度灾难(curse of dimensionality)以及对密集数据的依赖问题。其解决方案的关键在于提出一种名为KHRONOS(Kernel Expansion Hierarchy for Reduced Order, Neural Optimized Surrogates)的人工智能框架,该框架通过分维度核展开的分层组合构建连续可微的目标场,并将其张量化为模式后叠加,从而实现高效、精确的模型构建与预测。

链接: https://arxiv.org/abs/2505.13315
作者: Reza T. Batley,Sourav Saha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS)
备注:

点击查看摘要

Abstract:Contemporary models of high dimensional physical systems are constrained by the curse of dimensionality and a reliance on dense data. We introduce KHRONOS (Kernel Expansion Hierarchy for Reduced Order, Neural Optimized Surrogates), an AI framework for model based, model free and model inversion tasks. KHRONOS constructs continuously differentiable target fields with a hierarchical composition of per-dimension kernel expansions, which are tensorized into modes and then superposed. We evaluate KHRONOS on a canonical 2D, Poisson equation benchmark: across 16 to 512 degrees of freedom (DoFs), it obtained L2 square errors of 5e-4 down to 6e-10. This represents a 100 time gain over Kolmogorov Arnold Networks (which itself reports a 100 times improvement on MLPs/PINNs with 100 times fewer parameters) when controlling for the number of parameters. This also represents a 1e4 times improvement in L2 square error compared to standard linear FEM at comparable DoFs. Inference complexity is dominated by inner products, yielding sub-millisecond full-field predictions that scale to an arbitrary resolution. For inverse problems, KHRONOS facilitates rapid, iterative level set recovery in only a few forward evaluations, with sub-microsecond per sample latency. KHRONOS scalability, expressivity, and interpretability open new avenues in constrained edge computing, online control, computer vision, and beyond.
zh

[AI-10] Cross-Cloud Data Privacy Protection: Optimizing Collaborative Mechanisms of AI Systems by Integrating Federated Learning and LLM s

【速读】:该论文旨在解决在云计算环境中共享敏感数据时的数据隐私保护问题,以及如何优化跨云环境的协作机制。其解决方案的关键在于将联邦学习(Federated Learning)与大规模语言模型相结合,构建一种跨云架构,在不暴露原始数据的前提下通过聚合分布式节点的模型更新来优化AI系统的协作机制,同时利用大规模语言模型强大的上下文和语义理解能力提升模型训练效率和决策能力。此外,引入安全通信层以保障模型更新和训练数据的隐私与完整性,从而实现跨云环境下的持续模型适应与微调。

链接: https://arxiv.org/abs/2505.13292
作者: Huaiying Luo,Cheng Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by 2025 IEEE 7th International Conference on Communications, Information System and Computer Engineering

点击查看摘要

Abstract:In the age of cloud computing, data privacy protection has become a major challenge, especially when sharing sensitive data across cloud environments. However, how to optimize collaboration across cloud environments remains an unresolved problem. In this paper, we combine federated learning with large-scale language models to optimize the collaborative mechanism of AI systems. Based on the existing federated learning framework, we introduce a cross-cloud architecture in which federated learning works by aggregating model updates from decentralized nodes without exposing the original data. At the same time, combined with large-scale language models, its powerful context and semantic understanding capabilities are used to improve model training efficiency and decision-making ability. We’ve further innovated by introducing a secure communication layer to ensure the privacy and integrity of model updates and training data. The model enables continuous model adaptation and fine-tuning across different cloud environments while protecting sensitive data. Experimental results show that the proposed method is significantly better than the traditional federated learning model in terms of accuracy, convergence speed and data privacy protection.
zh

[AI-11] meSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents

【速读】:该论文试图解决现有基准测试在可扩展性、任务多样性以及研究成果评估维度上的不足,这些问题限制了AI代理在时间序列机器学习工程实践中的有效评估。解决方案的关键在于构建一个可扩展的基准测试框架——TimeSeriesGym,该框架通过两个核心维度提升评估的相关性:一是整合来自多个领域和任务的多样化挑战,以评估AI代理的综合能力;二是实现对多种研究成果(如提交文件、代码和模型)的评估机制,结合精确数值指标与基于大语言模型(LLM)的灵活评估方法,从而平衡客观评价与情境判断。

链接: https://arxiv.org/abs/2505.13291
作者: Yifu Cai,Xinyu Li,Mononito Goswami,Michał Wiliński,Gus Welter,Artur Dubrawski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Open source code available at this https URL . YC, XL, MG and MW contributed equally, and should be considered joint first authors

点击查看摘要

Abstract:We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents on time series machine learning engineering challenges. Existing benchmarks lack scalability, focus narrowly on model building in well-defined settings, and evaluate only a limited set of research artifacts (e.g., CSV submission files). To make AI agent benchmarking more relevant to the practice of machine learning engineering, our framework scales along two critical dimensions. First, recognizing that effective ML engineering requires a range of diverse skills, TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks. We design challenges to evaluate both isolated capabilities (including data handling, understanding research repositories, and code translation) and their combinations, and rather than addressing each challenge independently, we develop tools that support designing multiple challenges at scale. Second, we implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models, using both precise numeric measures and more flexible LLM-based evaluation approaches. This dual strategy balances objective assessment with contextual judgment. Although our initial focus is on time series applications, our framework can be readily extended to other data modalities, broadly enhancing the comprehensiveness and practical utility of agentic AI evaluation. We open-source our benchmarking framework to facilitate future research on the ML engineering capabilities of AI agents.
zh

[AI-12] Level Generation with Quantum Reservoir Computing

【速读】:该论文试图解决如何将量子储备计算(Quantum Reservoir Computing)从生成音乐作品变体的初始设计,适应到实时生成《超级马里奥兄弟》(Super Mario Bros.)关卡的问题。其解决方案的关键在于利用超导量子比特硬件实现课程的实时生成,并在此过程中探索此类实时生成所面临的约束。

链接: https://arxiv.org/abs/2505.13287
作者: João S. Ferreira,Pierre Fromholz,Hari Shaji,James R. Wootton
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Reservoir computing is a form of machine learning particularly suited for time series analysis, including forecasting predictions. We take an implementation of \emphquantum reservoir computing that was initially designed to generate variants of musical scores and adapt it to create levels of Super Mario Bros. Motivated by our analysis of these levels, we develop a new Roblox \textitobby where the courses can be generated in real time on superconducting qubit hardware, and investigate some of the constraints placed by such real-time generation.
zh

[AI-13] FlowPure: Continuous Normalizing Flows for Adversarial Purification

【速读】:该论文试图解决机器学习系统中对抗鲁棒性不足的问题,特别是在推理阶段去除对抗扰动的挑战。其解决方案的关键在于提出一种基于连续归一化流(Continuous Normalizing Flows, CNFs)的新型净化方法FlowPure,该方法通过条件流匹配(Conditional Flow Matching, CFM)学习从对抗样本到干净样本的映射,相较于以往依赖固定噪声过程的扩散模型方法,FlowPure能够利用特定攻击知识提升已知威胁下的鲁棒性,并支持在缺乏攻击知识时使用高斯扰动训练的通用随机变体。

链接: https://arxiv.org/abs/2505.13280
作者: Elias Collaert,Abel Rodríguez,Sander Joos,Lieven Desmet,Vera Rimmer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Despite significant advancements in the area, adversarial robustness remains a critical challenge in systems employing machine learning models. The removal of adversarial perturbations at inference time, known as adversarial purification, has emerged as a promising defense strategy. To achieve this, state-of-the-art methods leverage diffusion models that inject Gaussian noise during a forward process to dilute adversarial perturbations, followed by a denoising step to restore clean samples before classification. In this work, we propose FlowPure, a novel purification method based on Continuous Normalizing Flows (CNFs) trained with Conditional Flow Matching (CFM) to learn mappings from adversarial examples to their clean counterparts. Unlike prior diffusion-based approaches that rely on fixed noise processes, FlowPure can leverage specific attack knowledge to improve robustness under known threats, while also supporting a more general stochastic variant trained on Gaussian perturbations for settings where such knowledge is unavailable. Experiments on CIFAR-10 and CIFAR-100 demonstrate that our method outperforms state-of-the-art purification-based defenses in preprocessor-blind and white-box scenarios, and can do so while fully preserving benign accuracy in the former. Moreover, our results show that not only is FlowPure a highly effective purifier but it also holds a strong potential for adversarial detection, identifying preprocessor-blind PGD samples with near-perfect accuracy.
zh

[AI-14] Seeing the Unseen: How EMoE Unveils Bias in Text-to-Image Diffusion Models

【速读】:该论文试图解决文本到图像扩散模型中不确定性估计的问题,这一问题由于模型参数量大(通常超过1亿)以及在高维空间中的复杂操作而尤为困难。解决方案的关键在于提出一种名为Epistemic Mixture of Experts (EMoE)的新框架,该框架通过利用预训练网络而不需额外训练,实现从提示中直接估计认知不确定性,并在扩散过程中的潜在空间中更好地捕捉认知不确定性。

链接: https://arxiv.org/abs/2505.13273
作者: Lucas Berry,Axel Brando,Wei-Di Chang,Juan Camilo Gamboa Higuera,David Meger
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Estimating uncertainty in text-to-image diffusion models is challenging because of their large parameter counts (often exceeding 100 million) and operation in complex, high-dimensional spaces with virtually infinite input possibilities. In this paper, we propose Epistemic Mixture of Experts (EMoE), a novel framework for efficiently estimating epistemic uncertainty in diffusion models. EMoE leverages pre-trained networks without requiring additional training, enabling direct uncertainty estimation from a prompt. We leverage a latent space within the diffusion process that captures epistemic uncertainty better than existing methods. Experimental results on the COCO dataset demonstrate EMoE’s effectiveness, showing a strong correlation between uncertainty and image quality. Additionally, EMoE identifies under-sampled languages and regions with higher uncertainty, revealing hidden biases in the training set. This capability demonstrates the relevance of EMoE as a tool for addressing fairness and accountability in AI-generated content.
zh

[AI-15] Net-Zero: A Comparative Study on Neural Network Design for Climate-Economic PDEs Under Uncertainty

【速读】:该论文旨在解决在不确定性下进行气候-经济建模所面临的显著计算挑战,这些挑战可能限制政策制定者有效应对气候变化的能力。其解决方案的关键在于采用基于神经网络的方法来求解高维最优控制问题,这些问题源于包含风险规避因素的气候减缓决策模型。通过构建一个考虑多种减缓路径(包括无排放资本和碳强度降低)的连续时间内生增长经济模型,并利用神经网络架构对传统有限差分方法生成的解进行基准测试,研究展示了合适的神经网络架构选择对于提高模型在不确定性下的求解精度和计算效率具有重要影响。

链接: https://arxiv.org/abs/2505.13264
作者: Carlos Rodriguez-Pardo,Louis Daumas,Leonardo Chiani,Massimo Tavoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF); Analysis of PDEs (math.AP)
备注: Under review

点击查看摘要

Abstract:Climate-economic modeling under uncertainty presents significant computational challenges that may limit policymakers’ ability to address climate change effectively. This paper explores neural network-based approaches for solving high-dimensional optimal control problems arising from models that incorporate ambiguity aversion in climate mitigation decisions. We develop a continuous-time endogenous-growth economic model that accounts for multiple mitigation pathways, including emission-free capital and carbon intensity reductions. Given the inherent complexity and high dimensionality of these models, traditional numerical methods become computationally intractable. We benchmark several neural network architectures against finite-difference generated solutions, evaluating their ability to capture the dynamic interactions between uncertainty, technology transitions, and optimal climate policy. Our findings demonstrate that appropriate neural architecture selection significantly impacts both solution accuracy and computational efficiency when modeling climate-economic systems under uncertainty. These methodological advances enable more sophisticated modeling of climate policy decisions, allowing for better representation of technology transitions and uncertainty-critical elements for developing effective mitigation strategies in the face of climate change.
zh

[AI-16] Composing Dextrous Grasping and In-hand Manipulation via Scoring with a Reinforcement Learning Critic

【速读】:该论文试图解决机器人在手操作(in-hand manipulation)中,由于依赖人工放置物体至合适初始抓取状态而导致的现实场景应用受限问题。其核心挑战在于如何找到既稳定又能促进所需操作目标的初始抓取状态。解决方案的关键在于利用为在手操作训练的强化学习智能体的评论家网络(critic network),对初始抓取进行评分和选择,从而提升操作成功率,且无需额外训练。

链接: https://arxiv.org/abs/2505.13253
作者: Lennart Röstel,Dominik Winkelbauer,Johannes Pitz,Leon Sievers,Berthold Bäuml
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-hand manipulation and grasping are fundamental yet often separately addressed tasks in robotics. For deriving in-hand manipulation policies, reinforcement learning has recently shown great success. However, the derived controllers are not yet useful in real-world scenarios because they often require a human operator to place the objects in suitable initial (grasping) states. Finding stable grasps that also promote the desired in-hand manipulation goal is an open problem. In this work, we propose a method for bridging this gap by leveraging the critic network of a reinforcement learning agent trained for in-hand manipulation to score and select initial grasps. Our experiments show that this method significantly increases the success rate of in-hand manipulation without requiring additional training. We also present an implementation of a full grasp manipulation pipeline on a real-world system, enabling autonomous grasping and reorientation even of unwieldy objects.
zh

[AI-17] Agent ic Publications: An LLM -Driven Framework for Interactive Scientific Publishing Supplementing Traditional Papers with AI-Powered Knowledge Systems

【速读】:该论文试图解决科学文献指数级增长所带来的研究者在复杂知识体系中导航的挑战,其解决方案的关键在于提出“Agentic Publications”框架,该框架基于大语言模型(LLM)驱动,将传统论文转化为交互式知识系统。该架构通过检索增强生成和多智能体验证,实现结构化数据与非结构化内容的整合,并提供面向人类和机器的接口,结合叙述性解释与机器可读输出,同时通过自动化验证和透明治理解决伦理问题。

链接: https://arxiv.org/abs/2505.13246
作者: Roberto Pugliese,George Kourousias,Francesco Venier,Grazia Garlatti Costa
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The exponential growth of scientific literature presents significant challenges for researchers navigating the complex knowledge landscape. We propose “Agentic Publications”, a novel LLM-driven framework complementing traditional publishing by transforming papers into interactive knowledge systems. Our architecture integrates structured data with unstructured content through retrieval-augmented generation and multi-agent verification. The framework offers interfaces for both humans and machines, combining narrative explanations with machine-readable outputs while addressing ethical considerations through automated validation and transparent governance. Key features include continuous knowledge updates, automatic integration of new findings, and customizable detail levels. Our proof-of-concept demonstrates multilingual interaction, API accessibility, and structured knowledge representation through vector databases, knowledge graphs, and verification agents. This approach enhances scientific communication across disciplines, improving efficiency and collaboration while preserving traditional publishing pathways, particularly valuable for interdisciplinary fields where knowledge integration remains challenging.
zh

[AI-18] A Physics-Inspired Optimizer: Velocity Regularized Adam

【速读】:该论文试图解决深度神经网络训练过程中优化算法在稳定性边缘运行导致的快速振荡和损失收敛速度变慢的问题。解决方案的关键在于引入一种基于速度的正则化项,即Velocity-Regularized Adam (VRAdam),该方法通过在学习率上施加高阶惩罚,使得当权重更新较大时算法自动减小学习率,从而有效抑制振荡并允许在必要时采用更激进的基础步长而不引发发散。

链接: https://arxiv.org/abs/2505.13196
作者: Pranav Vaidhyanathan,Lucas Schorling,Natalia Ares,Michael A. Osborne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: L. Schorling and P. Vaidhyanathan contributed equally to this work. 20 pages, 13 figures

点击查看摘要

Abstract:We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so called adaptive edge of stability regime during training leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, damping oscillations and allowing for a more aggressive base step size when necessary without divergence. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam to create a hybrid optimizer, we demonstrate that VRAdam consistently exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, image generation and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.
zh

[AI-19] Adversarial Testing in LLM s: Insights into Decision-Making Vulnerabilities

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在现实决策系统中行为脆弱性的问题,特别是在面对对抗性操控或动态环境时的适应性和策略灵活性不足。解决方案的关键在于提出一种对抗性评估框架,该框架通过认知心理学和博弈论的方法,系统地对LLMs的决策过程进行压力测试,以揭示其在探索与利用权衡、社会合作及策略灵活性方面的表现,从而为可信AI的部署提供诊断决策弱点的方法论支持。

链接: https://arxiv.org/abs/2505.13195
作者: Lili Zhang,Haomiaomiao Wang,Long Cheng,Libao Deng,Tomas Ward
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly integrated into real-world decision-making systems, understanding their behavioural vulnerabilities remains a critical challenge for AI safety and alignment. While existing evaluation metrics focus primarily on reasoning accuracy or factual correctness, they often overlook whether LLMs are robust to adversarial manipulation or capable of using adaptive strategy in dynamic environments. This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of LLMs under interactive and adversarial conditions. Drawing on methodologies from cognitive psychology and game theory, our framework probes how models respond in two canonical tasks: the two-armed bandit task and the Multi-Round Trust Task. These tasks capture key aspects of exploration-exploitation trade-offs, social cooperation, and strategic flexibility. We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3, revealing model-specific susceptibilities to manipulation and rigidity in strategy adaptation. Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment. Rather than offering a performance benchmark, this work proposes a methodology for diagnosing decision-making weaknesses in LLM-based agents, providing actionable insights for alignment and safety research.
zh

[AI-20] rue Zero-Shot Inference of Dynamical Systems Preserving Long-Term Statistics

【速读】:该论文试图解决动态系统(Dynamical Systems, DS)重构(DSR)中现有方法需要针对新系统进行专门训练的问题,缺乏零样本和上下文推理能力。其解决方案的关键在于提出DynaMix,一种基于多变量ALRNN的专家混合架构,该架构通过预训练实现了对未见领域动态系统的零样本泛化能力,能够在不进行微调的情况下准确预测新型动态系统的长期演化,且在参数量和推理速度上具有显著优势。

链接: https://arxiv.org/abs/2505.13192
作者: Christoph Jürgen Hemmer,Daniel Durstewitz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
备注:

点击查看摘要

Abstract:Complex, temporally evolving phenomena, from climate to brain activity, are governed by dynamical systems (DS). DS reconstruction (DSR) seeks to infer generative surrogate models of these from observed data, reproducing their long-term behavior. Existing DSR approaches require purpose-training for any new system observed, lacking the zero-shot and in-context inference capabilities known from LLMs. Here we introduce DynaMix, a novel multivariate ALRNN-based mixture-of-experts architecture pre-trained for DSR, the first DSR model able to generalize zero-shot to out-of-domain DS. Just from a provided context signal, without any re-training, DynaMix faithfully forecasts the long-term evolution of novel DS where existing time series (TS) foundation models, like Chronos, fail – at a fraction of the number of parameters and orders of magnitude faster inference times. DynaMix outperforms TS foundation models in terms of long-term statistics, and often also short-term forecasts, even on real-world time series, like traffic or weather data, typically used for training and evaluating TS models, but not at all part of DynaMix’ training corpus. We illustrate some of the failure modes of TS models for DSR problems, and conclude that models built on DS principles may bear a huge potential also for advancing the TS prediction field.
zh

[AI-21] When a Reinforcement Learning Agent Encounters Unknown Unknowns

【速读】:该论文试图解决强化学习中智能体在遇到未知未知状态(unknown unknown)时无法有效处理的问题,即当智能体基于定义在已知领域上的价值函数Q和V采取行动后,意外进入未意识到的状态。解决方案的关键在于提出一种带有不断扩展意识的事件性马尔可夫决策过程(episodic Markov decision process with growing awareness, EMDP-GA)模型,并采用非信息性价值扩展(noninformative value expansion, NIVE)方法,通过将新发现状态的价值函数初始化为已知领域的平均值,以尊重该状态完全缺乏知识的特性。

链接: https://arxiv.org/abs/2505.13188
作者: Juntian Zhu,Miguel de Carvalho,Zhouwang Yang,Fengxiang He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:An AI agent might surprisingly find she has reached an unknown state which she has never been aware of – an unknown unknown. We mathematically ground this scenario in reinforcement learning: an agent, after taking an action calculated from value functions Q and V defined on the \it aware domain, reaches a state out of the domain. To enable the agent to handle this scenario, we propose an \it episodic Markov decision process with growing awareness (EMDP-GA) model, taking a new \it noninformative value expansion (NIVE) approach to expand value functions to newly aware areas: when an agent arrives at an unknown unknown, value functions Q and V whereon are initialised by noninformative beliefs – the averaged values on the aware domain. This design is out of respect for the complete absence of knowledge in the newly discovered state. The upper confidence bound momentum Q-learning is then adapted to the growing awareness for training the EMDP-GA model. We prove that (1) the regret of our approach is asymptotically consistent with the state of the art (SOTA) without exposure to unknown unknowns in an extremely uncertain environment, and (2) our computational complexity and space complexity are comparable with the SOTA – these collectively suggest that though an unknown unknown is surprising, it will be asymptotically properly discovered with decent speed and an affordable cost.
zh

[AI-22] Information Science Principles of Machine Learning: A Causal Chain Meta-Framework Based on Formalized Information Mapping

【速读】:该论文试图解决机器学习领域缺乏统一的正式理论框架,以及模型可解释性和伦理安全保证不足的问题。其解决方案的关键在于构建一个形式化的信息模型,通过良构公式集明确界定机器学习中典型组件的本体状态和载体映射,并引入可学习和可处理的谓词及学习与处理函数,以分析模型中因果链的逻辑演绎和约束规则,从而建立机器学习理论的元框架(MLT-MF)。

链接: https://arxiv.org/abs/2505.13182
作者: Jianfeng Xu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:[Objective] This study focuses on addressing the current lack of a unified formal theoretical framework in machine learning, as well as the deficiencies in interpretability and ethical safety assurance. [Methods] A formal information model is first constructed, utilizing sets of well-formed formulas to explicitly define the ontological states and carrier mappings of typical components in machine learning. Learnable and processable predicates, along with learning and processing functions, are introduced to analyze the logical deduction and constraint rules of the causal chains within models. [Results] A meta-framework for machine learning theory (MLT-MF) is established. Based on this framework, universal definitions for model interpretability and ethical safety are proposed. Furthermore, three key theorems are proved: the equivalence of model interpretability and information recoverability, the assurance of ethical safety, and the estimation of generalization error. [Limitations] The current framework assumes ideal conditions with noiseless information-enabling mappings and primarily targets model learning and processing logic in static scenarios. It does not yet address information fusion and conflict resolution across ontological spaces in multimodal or multi-agent systems. [Conclusions] This work overcomes the limitations of fragmented research and provides a unified theoretical foundation for systematically addressing the critical challenges currently faced in machine learning.
zh

[AI-23] ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

【速读】:该论文试图解决在视觉规划任务中,如何有效结合符号化规划与视觉语言模型(Vision-Language Models, VLMs)以实现可验证和具象化的计划问题。其解决方案的关键在于引入ViPlan,首个面向符号谓词与VLM的视觉规划开源基准,通过构建两个领域的复杂任务(经典积木世界问题的视觉变体和模拟家庭机器人环境),提供统一的评估环境与协议,从而系统比较基于VLM的符号化规划方法与直接使用VLM生成动作的方法的性能差异。

链接: https://arxiv.org/abs/2505.13180
作者: Matteo Merler,Nicola Dainese,Minttu Alakuijala,Giovanni Bonetta,Pietro Ferrazzi,Yu Tian,Bernardo Magnini,Pekka Marttinen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures and 1 table in the main text; 43 pages, 9 figures and 16 tables including supplementary material

点击查看摘要

Abstract:Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.
zh

[AI-24] Enhancing LLM s for Time Series Forecasting via Structure-Guided Cross-Modal Alignment

【速读】:该论文试图解决时间序列预测中跨模态对齐的问题,即如何有效利用预训练大语言模型(Large Language Models, LLMs)的全局序列建模能力,以提升时间序列数据的语言特性表示和泛化能力。现有方法主要依赖于局部token级或层级特征映射,忽视了LLMs在建模整体序列结构上的核心优势。解决方案的关键在于提出结构引导的跨模态对齐(Structure-Guided Cross-Modal Alignment, SGCMA),通过学习文本数据中的状态转移矩阵,并将其与时间序列数据进行序列级结构一致性对齐,从而赋予时间序列语言般的动态特性,实现更优的跨模态对齐效果。

链接: https://arxiv.org/abs/2505.13175
作者: Siming Sun,Kai Zhang,Xuejun Jiang,Wenchao Meng,Qinmin Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emerging paradigm of leveraging pretrained large language models (LLMs) for time series forecasting has predominantly employed linguistic-temporal modality alignment strategies through token-level or layer-wise feature mapping. However, these approaches fundamentally neglect a critical insight: the core competency of LLMs resides not merely in processing localized token features but in their inherent capacity to model holistic sequence structures. This paper posits that effective cross-modal alignment necessitates structural consistency at the sequence level. We propose the Structure-Guided Cross-Modal Alignment (SGCMA), a framework that fully exploits and aligns the state-transition graph structures shared by time-series and linguistic data as sequential modalities, thereby endowing time series with language-like properties and delivering stronger generalization after modality alignment. SGCMA consists of two key components, namely Structure Alignment and Semantic Alignment. In Structure Alignment, a state transition matrix is learned from text data through Hidden Markov Models (HMMs), and a shallow transformer-based Maximum Entropy Markov Model (MEMM) receives the hot-start transition matrix and annotates each temporal patch into state probability, ensuring that the temporal representation sequence inherits language-like sequential dynamics. In Semantic Alignment, cross-attention is applied between temporal patches and the top-k tokens within each state, and the ultimate temporal embeddings are derived by the expected value of these embeddings using a weighted average based on state probabilities. Experiments on multiple benchmarks demonstrate that SGCMA achieves state-of-the-art performance, offering a novel approach to cross-modal alignment in time series forecasting.
zh

[AI-25] mporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning ICML

【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, RL)中由于分布外(Out-of-Distribution, OOD)样本导致的性能下降问题,特别是在稀疏奖励和长时序任务中的表现不佳问题。其解决方案的关键在于提出一种名为时间距离感知的过渡增强框架(Temporal Distance-Aware Transition Augmentation, TempDATA),该框架在时间结构化的潜在空间中生成增强的转移样本,而非直接在原始状态空间中进行增强,从而更好地捕捉长期行为的时序特性。

链接: https://arxiv.org/abs/2505.13144
作者: Dongsu Lee,Minhae Kwon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 2025 ICML

点击查看摘要

Abstract:The goal of offline reinforcement learning (RL) is to extract a high-performance policy from the fixed datasets, minimizing performance degradation due to out-of-distribution (OOD) samples. Offline model-based RL (MBRL) is a promising approach that ameliorates OOD issues by enriching state-action transitions with augmentations synthesized via a learned dynamics model. Unfortunately, seminal offline MBRL methods often struggle in sparse-reward, long-horizon tasks. In this work, we introduce a novel MBRL framework, dubbed Temporal Distance-Aware Transition Augmentation (TempDATA), that generates augmented transitions in a temporally structured latent space rather than in raw state space. To model long-horizon behavior, TempDATA learns a latent abstraction that captures a temporal distance from both trajectory and transition levels of state space. Our experiments confirm that TempDATA outperforms previous offline MBRL methods and achieves matching or surpassing the performance of diffusion-based trajectory augmentation and goal-conditioned RL on the D4RL AntMaze, FrankaKitchen, CALVIN, and pixel-based FrankaKitchen.
zh

[AI-26] μPC: Scaling Predictive Coding to 100 Layer Networks

【速读】:该论文试图解决预测编码(Predictive Coding, PC)等基于局部信息的替代算法在训练深度网络时存在的生物不合理性与可扩展性问题,这些问题限制了其在大规模场景中与反向传播(Backpropagation, BP)竞争的能力。论文提出的解决方案的关键在于采用Depth-μP参数化方法(称为“μPC”),通过对其在深度预测编码网络(PCNs)中的缩放行为进行深入分析,揭示了导致标准PCNs难以训练的多种病理现象,并证明μPC能够在无需大量调参的情况下,实现高达128层的残差网络的稳定训练,同时具备权重和活动学习率的零样本迁移能力。

链接: https://arxiv.org/abs/2505.13124
作者: Francesco Innocenti,El Mehdi Achour,Christopher L. Buckley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 34 pages, 41 figures

点击查看摘要

Abstract:The biological implausibility of backpropagation (BP) has motivated many alternative, brain-inspired algorithms that attempt to rely only on local information, such as predictive coding (PC) and equilibrium propagation. However, these algorithms have notoriously struggled to train very deep networks, preventing them from competing with BP in large-scale settings. Indeed, scaling PC networks (PCNs) has recently been posed as a challenge for the community (Pinchetti et al., 2024). Here, we show that 100+ layer PCNs can be trained reliably using a Depth- \mu P parameterisation (Yang et al., 2023; Bordelon et al., 2023) which we call " \mu PC". Through an extensive analysis of the scaling behaviour of PCNs, we reveal several pathologies that make standard PCNs difficult to train at large depths. We then show that, despite addressing only some of these instabilities, \mu PC allows stable training of very deep (up to 128-layer) residual networks on simple classification tasks with competitive performance and little tuning compared to current benchmarks. Moreover, \mu PC enables zero-shot transfer of both weight and activity learning rates across widths and depths. Our results have implications for other local algorithms and could be extended to convolutional and transformer architectures. Code for \mu PC is made available as part of a JAX library for PCNs at this https URL (Innocenti et al., 2024).
zh

[AI-27] When majority rules minority loses: bias amplification of gradient descent

【速读】:该论文试图解决机器学习中偏差放大(bias amplification)的理论基础不明确的问题,特别是在多数-少数群体学习任务中,标准训练过程可能倾向于偏好多数群体并产生忽视少数群体特性的刻板预测器。其解决方案的关键在于构建一个形式化框架,通过假设总体和方差不平衡,揭示了三个关键发现:(i)“全数据”预测器与刻板预测器之间的紧密接近性,(ii)训练整个模型时主要学习多数群体特征的主导区域,(iii)额外训练所需的下限。这些分析为理解偏差放大的机制提供了理论支持。

链接: https://arxiv.org/abs/2505.13122
作者: François Bachoc(IMT),Jérôme Bolte(TSE-R),Ryan Boustany(TSE-R),Jean-Michel Loubes(IMT)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Despite growing empirical evidence of bias amplification in machine learning, its theoretical foundations remain poorly understood. We develop a formal framework for majority-minority learning tasks, showing how standard training can favor majority groups and produce stereotypical predictors that neglect minority-specific features. Assuming population and variance imbalance, our analysis reveals three key findings: (i) the close proximity between ``full-data’’ and stereotypical predictors, (ii) the dominance of a region where training the entire model tends to merely learn the majority traits, and (iii) a lower bound on the additional training required. Our results are illustrated through experiments in deep learning for tabular and image classification tasks.
zh

[AI-28] Unveil Sources of Uncertainty: Feature Contribution to Conformal Prediction Intervals

【速读】:该论文试图解决现有可解释人工智能(XAI)框架主要关注平均模型预测,而忽视预测不确定性的局限性。其解决方案的关键在于提出一种基于符合性预测(conformal prediction, CP)的模型无关不确定性归因(uncertainty attribution, UA)方法,通过构建以CP区间属性(如宽度和边界)作为价值函数的合作博弈,系统地将预测不确定性归因于输入特征。该方法扩展了传统的Shapley值,采用更丰富的Harsanyi分配,特别是比例Shapley值,按特征重要性比例进行归因,并通过蒙特卡洛近似方法实现计算可行性,同时保证稳健的统计性能。

链接: https://arxiv.org/abs/2505.13118
作者: Marouane Il Idrissi(UQAM, IID),Agathe Fernandes Machado(UQAM),Ewen Gallic(AMSE),Arthur Charpentier(UQAM)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Cooperative game theory methods, notably Shapley values, have significantly enhanced machine learning (ML) interpretability. However, existing explainable AI (XAI) frameworks mainly attribute average model predictions, overlooking predictive uncertainty. This work addresses that gap by proposing a novel, model-agnostic uncertainty attribution (UA) method grounded in conformal prediction (CP). By defining cooperative games where CP interval properties-such as width and bounds-serve as value functions, we systematically attribute predictive uncertainty to input features. Extending beyond the traditional Shapley values, we use the richer class of Harsanyi allocations, and in particular the proportional Shapley values, which distribute attribution proportionally to feature importance. We propose a Monte Carlo approximation method with robust statistical guarantees to address computational feasibility, significantly improving runtime efficiency. Our comprehensive experiments on synthetic benchmarks and real-world datasets demonstrate the practical utility and interpretative depth of our approach. By combining cooperative game theory and conformal prediction, we offer a rigorous, flexible toolkit for understanding and communicating predictive uncertainty in high-stakes ML applications.
zh

[AI-29] Continuous Fair SMOTE – Fairness-Aware Stream Learning from Imbalanced Data

【速读】:该论文旨在解决在线机器学习中数据流的公平性问题以及类别不平衡问题。现有方法通常通过事前或事后处理分别优化特定的歧视度量指标来应对这些问题,但这种方法可能导致算法偏差的引入。本文提出的CFSMOTE是一种面向公平性的连续SMOTE变体,其关键在于通过情境测试和在过采样过程中平衡与公平相关的群体,从而同时解决类别不平衡和公平性问题,避免仅优化单一公平度量指标所带来的潜在权衡问题。

链接: https://arxiv.org/abs/2505.13116
作者: Kathrin Lammers,Valerie Vaquet,Barbara Hammer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As machine learning is increasingly applied in an online fashion to deal with evolving data streams, the fairness of these algorithms is a matter of growing ethical and legal concern. In many use cases, class imbalance in the data also needs to be dealt with to ensure predictive performance. Current fairness-aware stream learners typically attempt to solve these issues through in- or post-processing by focusing on optimizing one specific discrimination metric, addressing class imbalance in a separate processing step. While C-SMOTE is a highly effective model-agnostic pre-processing approach to mitigate class imbalance, as a side effect of this method, algorithmic bias is often introduced. Therefore, we propose CFSMOTE - a fairness-aware, continuous SMOTE variant - as a pre-processing approach to simultaneously address the class imbalance and fairness concerns by employing situation testing and balancing fairness-relevant groups during oversampling. Unlike other fairness-aware stream learners, CFSMOTE is not optimizing for only one specific fairness metric, therefore avoiding potentially problematic trade-offs. Our experiments show significant improvement on several common group fairness metrics in comparison to vanilla C-SMOTE while maintaining competitive performance, also in comparison to other fairness-aware algorithms. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.13116 [cs.LG] (or arXiv:2505.13116v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.13116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-30] Lightweight Transformer via Unrolling of Mixed Graph Algorithms for Traffic Forecast

【速读】:该论文旨在解决具有时空维度的交通预测问题,其核心挑战在于如何有效建模空间相关性和时间序列关系。解决方案的关键在于将基于混合图的优化算法展开为一种轻量且可解释的类Transformer神经网络,通过构建无向图Gu\mathcal{G}^u和有向图Gd\mathcal{G}^d分别捕捉地理空间相关性与时间序列关系,并引入新的2\ell_21\ell_1-范数变分项以量化并促进信号在有向图上的平滑性(低频重构)。此外,通过交替方向乘子法(ADMM)构建迭代算法并展开为前馈网络,实现了数据驱动的参数学习,同时在网络中插入类似Transformer自注意力机制的图学习模块,从而在保持高性能的同时显著减少参数数量。

链接: https://arxiv.org/abs/2505.13102
作者: Ji Qi,Tam Thuc Do,Mingxiao Liu,Zhuoshi Pan,Yuzhe Li,Gene Cheung,H. Vicky Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 19 pages, 5 figures, 8 tables

点击查看摘要

Abstract:To forecast traffic with both spatial and temporal dimensions, we unroll a mixed-graph-based optimization algorithm into a lightweight and interpretable transformer-like neural net. Specifically, we construct two graphs: an undirected graph \mathcalG^u capturing spatial correlations across geography, and a directed graph \mathcalG^d capturing sequential relationships over time. We formulate a prediction problem for the future samples of signal \mathbfx , assuming it is “smooth” with respect to both \mathcalG^u and \mathcalG^d , where we design new \ell_2 and \ell_1 -norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We construct an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We insert graph learning modules for \mathcalG^u and \mathcalG^d , which are akin to the self-attention mechanism in classical transformers. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically. Our code is available in this https URL.
zh

[AI-31] me-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation

【速读】:该论文旨在解决现有因果语音分离模型在保留历史信息方面存在的困难,从而导致其性能不如非因果模型的问题。解决方案的关键在于提出一种基于时间-频率注意力缓存记忆(Time-Frequency Attention Cache Memory, TFACM)的模型,该模型通过注意力机制和缓存记忆(Cache Memory, CM)有效捕捉时频域中的时空关系,其中LSTM层用于捕获频率相对位置,因果建模结合局部与全局表示处理时间维度,CM模块存储历史信息,而因果注意力精炼(Causal Attention Refinement, CAR)模块进一步提升时间特征表示的细粒度。

链接: https://arxiv.org/abs/2505.13094
作者: Guo Chen,Kai Li,Runxuan Yang,Xiaolin Hu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Existing causal speech separation models often underperform compared to non-causal models due to difficulties in retaining historical information. To address this, we propose the Time-Frequency Attention Cache Memory (TFACM) model, which effectively captures spatio-temporal relationships through an attention mechanism and cache memory (CM) for historical information storage. In TFACM, an LSTM layer captures frequency-relative positions, while causal modeling is applied to the time dimension using local and global representations. The CM module stores past information, and the causal attention refinement (CAR) module further enhances time-based feature representations for finer granularity. Experimental results showed that TFACM achieveed comparable performance to the SOTA TF-GridNet-Causal model, with significantly lower complexity and fewer trainable parameters. For more details, visit the project page: this https URL.
zh

[AI-32] Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings

【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)的基准测试问题,其核心在于通过图对齐(graph alignment)任务构建一种新的评估方法。图对齐是一个组合优化问题,旨在对齐两个无标签图以最大化重叠边,该问题通过自监督学习框架进行建模,并利用合成随机图和多领域真实世界图数据集生成不同难度级别的图对齐数据集。解决方案的关键在于通过逐步增加数据集难度来评估不同架构的性能,并验证各模型在无监督预训练中的有效性,从而为GNN的性能评估与改进提供新思路。

链接: https://arxiv.org/abs/2505.13087
作者: Adrien Lagesse,Marc Lelarge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel benchmarking methodology for graph neural networks (GNNs) based on the graph alignment problem, a combinatorial optimization task that generalizes graph isomorphism by aligning two unlabeled graphs to maximize overlapping edges. We frame this problem as a self-supervised learning task and present several methods to generate graph alignment datasets using synthetic random graphs and real-world graph datasets from multiple domains. For a given graph dataset, we generate a family of graph alignment datasets with increasing difficulty, allowing us to rank the performance of various architectures. Our experiments indicate that anisotropic graph neural networks outperform standard convolutional architectures. To further demonstrate the utility of the graph alignment task, we show its effectiveness for unsupervised GNN pre-training, where the learned node embeddings outperform other positional encodings on three molecular regression tasks and achieve state-of-the-art results on the PCQM4Mv2 dataset with significantly fewer parameters. To support reproducibility and further research, we provide an open-source Python package to generate graph alignment datasets and benchmark new GNN architectures.
zh

[AI-33] MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers

【速读】:该论文旨在解决传统有声书系统在生成过程中存在的问题,包括需要用户手动配置说话人的韵律、句子读音单调以及依赖昂贵的训练过程。其解决方案的关键在于引入了两个创新流程:MSP(Multimodal Speaker Persona Generation)和LSI(LLM-based Script Instruction Generation),通过这两个流程,MultiActor-Audiobook能够在无需额外训练的情况下生成情感丰富且语音风格一致的有声书。

链接: https://arxiv.org/abs/2505.13082
作者: Kyeongman Park,Seongho Joo,Kyomin Jung
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce MultiActor-Audiobook, a zero-shot approach for generating audiobooks that automatically produces consistent, expressive, and speaker-appropriate prosody, including intonation and emotion. Previous audiobook systems have several limitations: they require users to manually configure the speaker’s prosody, read each sentence with a monotonic tone compared to voice actors, or rely on costly training. However, our MultiActor-Audiobook addresses these issues by introducing two novel processes: (1) MSP (Multimodal Speaker Persona Generation) and (2) LSI (LLM-based Script Instruction Generation). With these two processes, MultiActor-Audiobook can generate more emotionally expressive audiobooks with a consistent speaker prosody without additional training. We compare our system with commercial products, through human and MLLM evaluations, achieving competitive results. Furthermore, we demonstrate the effectiveness of MSP and LSI through ablation studies.
zh

[AI-34] he Hidden Dangers of Browsing AI Agents

【速读】:该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的自主浏览代理在面对动态内容、工具执行和用户提供的数据时所暴露的安全漏洞问题。其关键解决方案是提出一种纵深防御策略,包括输入净化、规划器执行器隔离、形式化分析器和会话防护,以抵御初始访问和后续利用攻击向量。

链接: https://arxiv.org/abs/2505.13076
作者: Mykyta Mudryi,Markiyan Chaklosh,Grzegorz Wójcik
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous browsing agents powered by large language models (LLMs) are increasingly used to automate web-based tasks. However, their reliance on dynamic content, tool execution, and user-provided data exposes them to a broad attack surface. This paper presents a comprehensive security evaluation of such agents, focusing on systemic vulnerabilities across multiple architectural layers. Our work outlines the first end-to-end threat model for browsing agents and provides actionable guidance for securing their deployment in real-world environments. To address discovered threats, we propose a defense in depth strategy incorporating input sanitization, planner executor isolation, formal analyzers, and session safeguards. These measures protect against both initial access and post exploitation attack vectors. Through a white box analysis of a popular open source project, Browser Use, we demonstrate how untrusted web content can hijack agent behavior and lead to critical security breaches. Our findings include prompt injection, domain validation bypass, and credential exfiltration, evidenced by a disclosed CVE and a working proof of concept exploit.
zh

[AI-35] Structure-Aware Corpus Construction and User-Perception-Aligned Metrics for Large-Language-Model Code Completion

【速读】:该论文试图解决当前代码补全评估指标与用户实际感知之间存在的差距,以及大型语言模型(Large Language Models, LLMs)在仓库级代码补全场景中缺乏有效的结构语义建模和跨模块依赖信息的问题。解决方案的关键在于提出两种基于概率建模的评估指标——LCP 和 ROUGE-LCP,以及一种基于结构保留与语义重排序代码图(Structure-Preserving and Semantically-Reordered Code Graph, SPSR-Graph)的数据处理方法,以提升模型性能并增强用户感知一致性。

链接: https://arxiv.org/abs/2505.13073
作者: Dengfeng Liu,Jucai Zhai,Xiaoguang Jiang,Ziqun Li,Qianjin Yu,Feng Liu,Rui Ye,Huang Liu,Zhiguo Yang,Yongsheng Du,Fang Tan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages,8 figures

点击查看摘要

Abstract:Code completion technology based on large language model has significantly improved the development efficiency of programmers. However, in practical applications, there remains a gap between current commonly used code completion evaluation metrics and users’ actual perception. To address this issue, we propose two evaluation metrics for code completion tasks–LCP and ROUGE-LCP, from the perspective of probabilistic modeling. Furthermore, to tackle the lack of effective structural semantic modeling and cross-module dependency information in LLMs for repository-level code completion scenarios, we propose a data processing method based on a Structure-Preserving and Semantically-Reordered Code Graph (SPSR-Graph). Through theoretical analysis and experimental validation, we demonstrate the superiority of the proposed evaluation metrics in terms of user perception consistency, as well as the effectiveness of the data processing method in enhancing model performance.
zh

[AI-36] CAIM: Development and Evaluation of a Cognitive AI Memory Framework for Long-Term Interaction with Intelligent Agents

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在长期交互中面临的挑战,包括对用户适应性、上下文知识以及对不断变化环境的理解不足。解决方案的关键在于提出一种基于认知人工智能(Cognitive AI)原理的记忆框架CAIM,该框架通过三个模块——记忆控制器、记忆检索和后思考模块——实现高效的信息检索与存储,从而提升模型的上下文感知能力和长期交互性能。

链接: https://arxiv.org/abs/2505.13044
作者: Rebecca Westhäußer,Frederik Berenz,Wolfgang Minker,Sebastian Zepf
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have advanced the field of artificial intelligence (AI) and are a powerful enabler for interactive systems. However, they still face challenges in long-term interactions that require adaptation towards the user as well as contextual knowledge and understanding of the ever-changing environment. To overcome these challenges, holistic memory modeling is required to efficiently retrieve and store relevant information across interaction sessions for suitable responses. Cognitive AI, which aims to simulate the human thought process in a computerized model, highlights interesting aspects, such as thoughts, memory mechanisms, and decision-making, that can contribute towards improved memory modeling for LLMs. Inspired by these cognitive AI principles, we propose our memory framework CAIM. CAIM consists of three modules: 1.) The Memory Controller as the central decision unit; 2.) the Memory Retrieval, which filters relevant data for interaction upon request; and 3.) the Post-Thinking, which maintains the memory storage. We compare CAIM against existing approaches, focusing on metrics such as retrieval accuracy, response correctness, contextual coherence, and memory storage. The results demonstrate that CAIM outperforms baseline frameworks across different metrics, highlighting its context-awareness and potential to improve long-term human-AI interactions.
zh

[AI-37] SPulse: Dual Space Tiny Pre-Trained Models for Rapid Time-Series Analysis

【速读】:该论文旨在解决时间序列预训练模型规模庞大、计算资源消耗高的问题,同时保持模型在分类、异常检测、填补和检索等任务中的高性能。其解决方案的关键在于提出TSPulse,一个参数量仅为1M的超轻量级时间序列预训练模型,通过双空间掩码重构(dual-space masked reconstruction)和双嵌入解耦(dual-embedding disentanglement)机制,从时域和频域学习互补信号,并生成细粒度分析嵌入与高层语义嵌入,增强模型对时间、幅度和噪声变化的鲁棒性。此外,引入TSLens和多头三角化技术以提升任务特定特征注意力和异常检测能力,结合混合掩码预训练策略优化零样本填补性能,从而在多个基准测试中实现显著性能提升。

链接: https://arxiv.org/abs/2505.13033
作者: Vijay Ekambaram,Subodh Kumar,Arindam Jati,Sumanta Mukherjee,Tomoya Sakai,Pankaj Dayama,Wesley M. Gifford,Jayant Kalagnanam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of time-series pre-trained models has advanced temporal representation learning, but current state-of-the-art models are often large-scale, requiring substantial compute. We introduce TSPulse, ultra-compact time-series pre-trained models with only 1M parameters, specialized to perform strongly across classification, anomaly detection, imputation, and retrieval tasks. TSPulse introduces innovations at both the architecture and task levels. At the architecture level, it employs a dual-space masked reconstruction, learning from both time and frequency domains to capture complementary signals. This is further enhanced by a dual-embedding disentanglement, generating both detailed embeddings for fine-grained analysis and high-level semantic embeddings for broader task understanding. Notably, TSPulse’s semantic embeddings are robust to shifts in time, magnitude, and noise, which is important for robust retrieval. At the task level, TSPulse incorporates TSLens, a fine-tuning component enabling task-specific feature attention. It also introduces a multi-head triangulation technique that correlates deviations from multiple prediction heads, enhancing anomaly detection by fusing complementary model outputs. Additionally, a hybrid mask pretraining is proposed to improves zero-shot imputation by reducing pre-training bias. These architecture and task innovations collectively contribute to TSPulse’s significant performance gains: 5-16% on the UEA classification benchmarks, +20% on the TSB-AD anomaly detection leaderboard, +50% in zero-shot imputation, and +25% in time-series retrieval. Remarkably, these results are achieved with just 1M parameters, making TSPulse 10-100X smaller than existing pre-trained models. Its efficiency enables GPU-free inference and rapid pre-training, setting a new standard for efficient time-series pre-trained models. Models will be open-sourced soon.
zh

[AI-38] MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

【速读】:该论文旨在解决当前文本到图像系统在处理多模态输入和复杂推理任务时的局限性。其解决方案的关键在于提出MindOmni,一个统一的多模态大语言模型,通过结合推理生成与强化学习来提升模型能力。核心创新包括:设计一个具有解码器仅架构的统一视觉语言模型,利用链式思维(Chain-of-Thought, CoT)指令数据进行监督微调,并引入一种名为推理生成策略优化(Reasoning Generation Policy Optimization, RGPO)的算法,通过多模态反馈有效指导策略更新。

链接: https://arxiv.org/abs/2505.13031
作者: Yicheng Xiao,Lin Song,Yukang Chen,Yingmin Luo,Yuxin Chen,Yukang Gan,Wei Huang,Xiu Li,Xiaojuan Qi,Ying Shan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Recent text-to-image systems face limitations in handling multimodal inputs and complex reasoning tasks. We introduce MindOmni, a unified multimodal large language model that addresses these challenges by incorporating reasoning generation through reinforcement learning. MindOmni leverages a three-phase training strategy: i) design of a unified vision language model with a decoder-only diffusion module, ii) supervised fine-tuning with Chain-of-Thought (CoT) instruction data, and iii) our proposed Reasoning Generation Policy Optimization (RGPO) algorithm, utilizing multimodal feedback to effectively guide policy updates. Experimental results demonstrate that MindOmni outperforms existing models, achieving impressive performance on both understanding and generation benchmarks, meanwhile showcasing advanced fine-grained reasoning generation capabilities, especially with mathematical reasoning instruction. All codes will be made public at \hrefthis https URLthis https URL.
zh

[AI-39] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理和逻辑问题求解中,因单独使用监督微调(Supervised Fine-Tuning, SFT)或强化学习(Reinforcement Learning, RL)所导致的过拟合或模式崩溃等问题,以及现有混合训练方法在任务泛化性和数据依赖性方面的不足。其解决方案的关键在于提出一种分步自适应混合训练框架(Step-wise Adaptive Hybrid Training Framework, SASR),该框架理论上统一了SFT与RL,并通过梯度范数和相对于原始分布的差异动态调整两者比例,实现SFT与在线RL方法GRPO的无缝融合,从而保证模型在保持核心推理能力的同时探索不同路径。

链接: https://arxiv.org/abs/2505.13026
作者: Jack Chen,Fazhong Liu,Naruto Liu,Yuhan Luo,Erqu Qin,Harry Zheng,Tian Dong,Haojin Zhu,Yan Meng,Xiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at mathematical reasoning and logical problem-solving. The current popular training paradigms primarily use supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance the models’ reasoning abilities. However, when using SFT or RL alone, there are respective challenges: SFT may suffer from overfitting, while RL is prone to mode collapse. The state-of-the-art methods have proposed hybrid training schemes. However, static switching faces challenges such as poor generalization across different tasks and high dependence on data quality. In response to these challenges, inspired by the curriculum learning-quiz mechanism in human reasoning cultivation, We propose SASR, a step-wise adaptive hybrid training framework that theoretically unifies SFT and RL and dynamically balances the two throughout optimization. SASR uses SFT for initial warm-up to establish basic reasoning skills, and then uses an adaptive dynamic adjustment algorithm based on gradient norm and divergence relative to the original distribution to seamlessly integrate SFT with the online RL method GRPO. By monitoring the training status of LLMs and adjusting the training process in sequence, SASR ensures a smooth transition between training schemes, maintaining core reasoning abilities while exploring different paths. Experimental results demonstrate that SASR outperforms SFT, RL, and static hybrid training methods.
zh

[AI-40] LiBOG: Lifelong Learning for Black-Box Optimizer Generation IJCAI2025

【速读】:该论文旨在解决传统Meta-Black-Box Optimization (MetaBBO)方法在现实场景中面临的问题,即其假设存在静态且具有代表性的训练问题分布,而这一假设在实际应用中往往不成立。为应对不断变化的问题分布,论文提出了一种新的终身学习范式,并引入LiBOG方法,该方法能够从连续遇到的问题中学习并生成高性能的黑盒优化器(BBO)。LiBOG的关键在于通过跨任务和任务内知识整合来缓解灾难性遗忘,从而在保持学习新任务的灵活性的同时,有效提升优化器性能。

链接: https://arxiv.org/abs/2505.13025
作者: Jiyuan Pei,Yi Mei,Jialin Liu,Mengjie Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI 2025. To appear

点击查看摘要

Abstract:Meta-Black-Box Optimization (MetaBBO) garners attention due to its success in automating the configuration and generation of black-box optimizers, significantly reducing the human effort required for optimizer design and discovering optimizers with higher performance than classic human-designed optimizers. However, existing MetaBBO methods conduct one-off training under the assumption that a stationary problem distribution with extensive and representative training problem samples is pre-available. This assumption is often impractical in real-world scenarios, where diverse problems following shifting distribution continually arise. Consequently, there is a pressing need for methods that can continuously learn from new problems encountered on-the-fly and progressively enhance their capabilities. In this work, we explore a novel paradigm of lifelong learning in MetaBBO and introduce LiBOG, a novel approach designed to learn from sequentially encountered problems and generate high-performance optimizers for Black-Box Optimization (BBO). LiBOG consolidates knowledge both across tasks and within tasks to mitigate catastrophic forgetting. Extensive experiments demonstrate LiBOG’s effectiveness in learning to generate high-performance optimizers in a lifelong learning manner, addressing catastrophic forgetting while maintaining plasticity to learn new tasks.
zh

[AI-41] Unveiling and Steering Connectome Organization with Interpretable Latent Variables

【速读】:该论文试图解决大脑复杂连接组(connectome)如何从有限的遗传编码中形成,并揭示其潜在的低维组织原则的问题。解决方案的关键在于提出一种结合果蝇连接组(FlyWire)子图提取与生成模型的框架,以获得可解释的神经回路低维表示,并通过可解释性模块将这些潜在维度与特定结构特征关联,从而揭示其功能相关性。

链接: https://arxiv.org/abs/2505.13011
作者: Yubin Li,Xingyu Liu,Guozhang Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The brain’s intricate connectome, a blueprint for its function, presents immense complexity, yet it arises from a compact genetic code, hinting at underlying low-dimensional organizational principles. This work bridges connectomics and representation learning to uncover these principles. We propose a framework that combines subgraph extraction from the Drosophila connectome, FlyWire, with a generative model to derive interpretable low-dimensional representations of neural circuitry. Crucially, an explainability module links these latent dimensions to specific structural features, offering insights into their functional relevance. We validate our approach by demonstrating effective graph reconstruction and, significantly, the ability to manipulate these latent codes to controllably generate connectome subgraphs with predefined properties. This research offers a novel tool for understanding brain architecture and a potential avenue for designing bio-inspired artificial neural networks.
zh

[AI-42] From Assistants to Adversaries: Exploring the Security Risks of Mobile LLM Agents

【速读】:该论文试图解决移动设备上基于大语言模型(Large Language Models, LLMs)的AI代理在安全方面的潜在风险问题,特别是针对其在语言推理、图形用户界面交互和系统级执行三个核心能力维度中的安全威胁。解决方案的关键在于提出了一种半自动化安全分析框架——AgentScan,该框架能够系统性地评估移动LLM代理在11个攻击场景下的安全性,并通过实际测试揭示了当前广泛部署的代理均存在被针对性攻击的漏洞,从而为构建安全的移动LLM代理提供了防御设计原则和实践建议。

链接: https://arxiv.org/abs/2505.12981
作者: Liangxuan Wu,Chao Wang,Tianming Liu,Yanjie Zhao,Haoyu Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The growing adoption of large language models (LLMs) has led to a new paradigm in mobile computing–LLM-powered mobile AI agents–capable of decomposing and automating complex tasks directly on smartphones. However, the security implications of these agents remain largely unexplored. In this paper, we present the first comprehensive security analysis of mobile LLM agents, encompassing three representative categories: System-level AI Agents developed by original equipment manufacturers (e.g., YOYO Assistant), Third-party Universal Agents (e.g., Zhipu AI AutoGLM), and Emerging Agent Frameworks (e.g., Alibaba Mobile Agent). We begin by analyzing the general workflow of mobile agents and identifying security threats across three core capability dimensions: language-based reasoning, GUI-based interaction, and system-level execution. Our analysis reveals 11 distinct attack surfaces, all rooted in the unique capabilities and interaction patterns of mobile LLM agents, and spanning their entire operational lifecycle. To investigate these threats in practice, we introduce AgentScan, a semi-automated security analysis framework that systematically evaluates mobile LLM agents across all 11 attack scenarios. Applying AgentScan to nine widely deployed agents, we uncover a concerning trend: every agent is vulnerable to targeted attacks. In the most severe cases, agents exhibit vulnerabilities across eight distinct attack vectors. These attacks can cause behavioral deviations, privacy leakage, or even full execution hijacking. Based on these findings, we propose a set of defensive design principles and practical recommendations for building secure mobile LLM agents. Our disclosures have received positive feedback from two major device vendors. Overall, this work highlights the urgent need for standardized security practices in the fast-evolving landscape of LLM-driven mobile automation.
zh

[AI-43] Hardware-Adaptive and Superlinear-Capacity Memristor-based Associative Memory

【速读】:该论文旨在解决传统硬件实现的霍普菲尔德神经网络(Hopfield Neural Networks, HNNs)在效率、缺陷容忍度和存储容量方面的瓶颈问题,以及基于忆阻器的HNN在离线训练下对硬件缺陷的敏感性、存储容量受限和处理模拟模式的困难。其解决方案的关键在于提出一种新型的硬件自适应学习算法,该算法显著提高了缺陷容忍度和存储容量,并自然扩展至可处理二进制和连续模式的可扩展多层架构,从而实现了比现有方法更高的有效容量和更优的性能表现。

链接: https://arxiv.org/abs/2505.12960
作者: Chengping He,Mingrui Jiang,Keyi Shan,Szu-Hao Yang,Zefan Li,Shengbo Wang,Giacomo Pedretti,Jim Ignowski,Can Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Brain-inspired computing aims to mimic cognitive functions like associative memory, the ability to recall complete patterns from partial cues. Memristor technology offers promising hardware for such neuromorphic systems due to its potential for efficient in-memory analog computing. Hopfield Neural Networks (HNNs) are a classic model for associative memory, but implementations on conventional hardware suffer from efficiency bottlenecks, while prior memristor-based HNNs faced challenges with vulnerability to hardware defects due to offline training, limited storage capacity, and difficulty processing analog patterns. Here we introduce and experimentally demonstrate on integrated memristor hardware a new hardware-adaptive learning algorithm for associative memories that significantly improves defect tolerance and capacity, and naturally extends to scalable multilayer architectures capable of handling both binary and continuous patterns. Our approach achieves 3x effective capacity under 50% device faults compared to state-of-the-art methods. Furthermore, its extension to multilayer architectures enables superlinear capacity scaling ((\propto N^1.49\ for binary patterns) and effective recalling of continuous patterns (\propto N^1.74\ scaling), as compared to linear capacity scaling for previous HNNs. It also provides flexibility to adjust capacity by tuning hidden neurons for the same-sized patterns. By leveraging the massive parallelism of the hardware enabled by synchronous updates, it reduces energy by 8.8x and latency by 99.7% for 64-dimensional patterns over asynchronous schemes, with greater improvements at scale. This promises the development of more reliable memristor-based associative memory systems and enables new applications research due to the significantly improved capacity, efficiency, and flexibility.
zh

[AI-44] DGRO: Enhancing LLM Reasoning via Exploration-Exploitation Control and Reward Variance Management

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)过程中,由于手动设计的奖励函数存在探索与利用之间的权衡问题,导致理论和实验上的影响尚未充分研究。其解决方案的关键在于提出一种通用的RL算法——解耦组奖励优化(Decoupled Group Reward Optimization, DGRO),通过将传统的正则化系数解耦为两个独立的超参数,分别控制策略梯度项和采样策略的距离,从而实现对探索与利用的精确控制,并可无缝扩展至其他算法如Online Policy Mirror Descent (OPMD)。此外,DGRO还通过理论分析和大量实验验证了奖励方差对收敛速度和模型性能的影响。

链接: https://arxiv.org/abs/2505.12951
作者: Xuerui Su,Liya Guo,Yue Wang,Yi Zhu,Zhiming Ma,Zun Wang,Yuting Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference scaling further accelerates Large Language Models (LLMs) toward Artificial General Intelligence (AGI), with large-scale Reinforcement Learning (RL) to unleash long Chain-of-Thought reasoning. Most contemporary reasoning approaches usually rely on handcrafted rule-based reward functions. However, the tarde-offs of exploration and exploitation in RL algorithms involves multiple complex considerations, and the theoretical and empirical impacts of manually designed reward functions remain insufficiently explored. In this paper, we propose Decoupled Group Reward Optimization (DGRO), a general RL algorithm for LLM reasoning. On the one hand, DGRO decouples the traditional regularization coefficient into two independent hyperparameters: one scales the policy gradient term, and the other regulates the distance from the sampling policy. This decoupling not only enables precise control over balancing exploration and exploitation, but also can be seamlessly extended to Online Policy Mirror Descent (OPMD) algorithms in Kimi k1.5 and Direct Reward Optimization. On the other hand, we observe that reward variance significantly affects both convergence speed and final model performance. We conduct both theoretical analysis and extensive empirical validation to assess DGRO, including a detailed ablation study that investigates its performance and optimization dynamics. Experimental results show that DGRO achieves state-of-the-art performance on the Logic dataset with an average accuracy of 96.9%, and demonstrates strong generalization across mathematical benchmarks.
zh

[AI-45] CPRet: A Dataset Benchmark and Model for Retrieval in Competitive Programming

【速读】:该论文旨在解决竞争性编程基准中重复或高度相似问题日益增多所带来的公平性和有效性问题。其关键解决方案是提出一种新的问题——相似问题检索,并构建了一个名为CPRet的面向检索的竞争性编程基准套件,涵盖四种检索任务,包括两种以代码为中心的任务(即Text-to-Code和Code-to-Code)以及两种新提出的以问题为中心的任务(即Problem-to-Duplicate和Simplified-to-Full),该数据集结合了自动爬取的问题-解决方案数据和人工标注。此外,基于该数据集开发了两种任务专用的检索器,分别针对问题-代码对齐和问题级相似性识别,取得了良好的效果。

链接: https://arxiv.org/abs/2505.12925
作者: Han Deng,Yuan Meng,Shixiang Tang,Wanli Ouyang,Xinzhu Ma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: main 9 pages

点击查看摘要

Abstract:Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly similar problems raises concerns not only about competition fairness, but also about the validity of competitive programming as a benchmark for model evaluation. In this paper, we propose a new problem – similar question retrieval – to address this issue. Due to the lack of both data and models, solving this problem is challenging. To this end, we introduce CPRet, a retrieval-oriented benchmark suite for competitive programming, covering four retrieval tasks: two code-centric (i.e., Text-to-Code and Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate and Simplified-to-Full), built from a combination of automatically crawled problem-solution data and manually curated annotations. Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation. In addition, we develop two task-specialized retrievers based on this dataset: CPRetriever-Code, trained with a novel Group-InfoNCE loss for problem-code alignment, and CPRetriever-Prob, fine-tuned for identifying problem-level similarity. Both models achieve strong results and are open-sourced for local use. Finally, we analyze LiveCodeBench and find that high-similarity problems inflate model pass rates and reduce differentiation, underscoring the need for similarity-aware evaluation in future benchmarks. Code and data are available at: this https URL Comments: main 9 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2505.12925 [cs.SE] (or arXiv:2505.12925v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2505.12925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-46] he Traitors: Deception and Trust in Multi-Agent Language Model Simulations

【速读】:该论文试图解决在需要信任和与人类价值观对齐的场景中,人工智能系统(AI systems)何时以及为何会采取欺骗行为的问题。其解决方案的关键在于引入名为“The Traitors”的多智能体仿真框架,该框架受社交推理游戏启发,旨在探究大语言模型(LLM)代理在信息不对称条件下的欺骗、信任形成及战略沟通行为。该框架基于博弈论、行为经济学和社会认知的正式理论框架,并提供了一套评估指标以衡量欺骗成功率、信任动态及集体推理质量,同时构建了一个全自主的仿真平台,支持异构代理群体、特定属性和自适应行为。

链接: https://arxiv.org/abs/2505.12923
作者: Pedro M. P. Curvo
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 main pages, 31 pages

点击查看摘要

Abstract:As AI systems increasingly assume roles where trust and alignment with human values are essential, understanding when and why they engage in deception has become a critical research priority. We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games, designed to probe deception, trust formation, and strategic communication among large language model (LLM) agents under asymmetric information. A minority of agents the traitors seek to mislead the majority, while the faithful must infer hidden identities through dialogue and reasoning. Our contributions are: (1) we ground the environment in formal frameworks from game theory, behavioral economics, and social cognition; (2) we develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality; (3) we implement a fully autonomous simulation platform where LLMs reason over persistent memory and evolving social dynamics, with support for heterogeneous agent populations, specialized traits, and adaptive behaviors. Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry: advanced models like GPT-4o demonstrate superior deceptive capabilities yet exhibit disproportionate vulnerability to others’ falsehoods. This suggests deception skills may scale faster than detection abilities. Overall, The Traitors provides a focused, configurable testbed for investigating LLM behavior in socially nuanced interactions. We position this work as a contribution toward more rigorous research on deception mechanisms, alignment challenges, and the broader social reliability of AI systems.
zh

[AI-47] SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs IJCAI25

【速读】:该论文旨在解决图结构中谣言源检测的问题,特别是现有基于机器学习的方法难以捕捉谣言传播的内在动态特性。其解决方案的关键在于提出SourceDetMamba:一种面向序列超图的图感知状态空间模型,该模型利用状态空间模型Mamba的全局建模能力和计算效率,通过超图建模社交网络中的高阶交互,并将传播过程中的时间网络快照以逆序输入Mamba以推断传播动态,同时引入一种新的图感知状态更新机制,使节点状态在时间依赖性和拓扑上下文的共同作用下得到传播与优化。

链接: https://arxiv.org/abs/2505.12910
作者: Le Cheng,Peican Zhu,Yangming Guo,Chao Gao,Zhen Wang,Keke Tang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI25

点击查看摘要

Abstract:Source detection on graphs has demonstrated high efficacy in identifying rumor origins. Despite advances in machine learning-based methods, many fail to capture intrinsic dynamics of rumor propagation. In this work, we present SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs, which harnesses the recent success of the state space model Mamba, known for its superior global modeling capabilities and computational efficiency, to address this challenge. Specifically, we first employ hypergraphs to model high-order interactions within social networks. Subsequently, temporal network snapshots generated during the propagation process are sequentially fed in reverse order into Mamba to infer underlying propagation dynamics. Finally, to empower the sequential model to effectively capture propagation patterns while integrating structural information, we propose a novel graph-aware state update mechanism, wherein the state of each node is propagated and refined by both temporal dependencies and topological context. Extensive evaluations on eight datasets demonstrate that SourceDetMamba consistently outperforms state-of-the-art approaches.
zh

[AI-48] Sinusoidal Initialization Time for a New Start

【速读】:该论文旨在解决深度神经网络(Deep Neural Network)训练中初始化方法存在的问题,特别是传统初始化方法如Glorot和He初始化依赖随机性,可能导致层间权重分布不均的问题。解决方案的关键在于提出了一种新的确定性初始化方法——正弦初始化(Sinusoidal initialization),该方法利用正弦函数构造结构化的权重矩阵,以改善网络中权重的分布与平衡,并促进神经元激活状态在首次前向传播时即具有更均匀、条件良好的分布。通过将随机性替换为结构性,该方法提升了模型的收敛速度、训练稳定性和最终准确性。

链接: https://arxiv.org/abs/2505.12909
作者: Alberto Fernández-Hernández,Jose I. Mestre,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Initialization plays a critical role in Deep Neural Network training, directly influencing convergence, stability, and generalization. Common approaches such as Glorot and He initializations rely on randomness, which can produce uneven weight distributions across layer connections. In this paper, we introduce the Sinusoidal initialization, a novel deterministic method that employs sinusoidal functions to construct structured weight matrices expressly to improve the spread and balance of weights throughout the network while simultaneously fostering a more uniform, well-conditioned distribution of neuron activation states from the very first forward pass. Because Sinusoidal initialization begins with weights and activations that are already evenly and efficiently utilized, it delivers consistently faster convergence, greater training stability, and higher final accuracy across a wide range of models, including convolutional neural networks, vision transformers, and large language models. On average, our experiments show an increase of 4.8 % in final validation accuracy and 20.9 % in convergence speed. By replacing randomness with structure, this initialization provides a stronger and more reliable foundation for Deep Learning systems.
zh

[AI-49] he Computation of Generalized Embeddings for Underwater Acoustic Target Recognition using Contrastive Learning

【速读】:该论文试图解决海洋环境中声污染监测的问题,特别是通过自动声音分类来识别噪声源,如船舶活动和海洋哺乳动物叫声。当前的解决方案依赖于监督学习,需要大量高质量的标记数据,而这类数据并不公开可用。该研究的关键在于探索利用大量低质量但公开的未标记数据,采用无监督对比学习(Contrastive Learning)方法,通过变分不变协方差正则化损失函数优化基于Conformer的编码器,从而生成具有鲁棒性和泛化能力的嵌入表示。

链接: https://arxiv.org/abs/2505.12904
作者: Hilde I. Hummel,Arwin Gansekoele,Sandjai Bhulai,Rob van der Mei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The increasing level of sound pollution in marine environments poses an increased threat to ocean health, making it crucial to monitor underwater noise. By monitoring this noise, the sources responsible for this pollution can be mapped. Monitoring is performed by passively listening to these sounds. This generates a large amount of data records, capturing a mix of sound sources such as ship activities and marine mammal vocalizations. Although machine learning offers a promising solution for automatic sound classification, current state-of-the-art methods implement supervised learning. This requires a large amount of high-quality labeled data that is not publicly available. In contrast, a massive amount of lower-quality unlabeled data is publicly available, offering the opportunity to explore unsupervised learning techniques. This research explores this possibility by implementing an unsupervised Contrastive Learning approach. Here, a Conformer-based encoder is optimized by the so-called Variance-Invariance-Covariance Regularization loss function on these lower-quality unlabeled data and the translation to the labeled data is made. Through classification tasks involving recognizing ship types and marine mammal vocalizations, our method demonstrates to produce robust and generalized embeddings. This shows to potential of unsupervised methods for various automatic underwater acoustic analysis tasks.
zh

[AI-50] HyperDet: Source Detection in Hypergraphs via Interactive Relationship Construction and Feature-rich Attention Fusion IJCAI25

【速读】:该论文旨在解决社会网络中谣言传播的谣言源检测问题,特别是针对现有方法主要关注二元互动而无法充分建模更复杂关系结构的不足。其解决方案的关键在于提出一种名为HyperDet的方法,通过交互关系构建和特征丰富的注意力融合实现对高阶关系的精确建模,从而有效学习节点表示。

链接: https://arxiv.org/abs/2505.12894
作者: Le Cheng,Peican Zhu,Yangming Guo,Keke Tang,Chao Gao,Zhen Wang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI25

点击查看摘要

Abstract:Hypergraphs offer superior modeling capabilities for social networks, particularly in capturing group phenomena that extend beyond pairwise interactions in rumor propagation. Existing approaches in rumor source detection predominantly focus on dyadic interactions, which inadequately address the complexity of more intricate relational structures. In this study, we present a novel approach for Source Detection in Hypergraphs (HyperDet) via Interactive Relationship Construction and Feature-rich Attention Fusion. Specifically, our methodology employs an Interactive Relationship Construction module to accurately model both the static topology and dynamic interactions among users, followed by the Feature-rich Attention Fusion module, which autonomously learns node features and discriminates between nodes using a self-attention mechanism, thereby effectively learning node representations under the framework of accurately modeled higher-order relationships. Extensive experimental validation confirms the efficacy of our HyperDet approach, showcasing its superiority relative to current state-of-the-art methods.
zh

[AI-51] PhyDA: Physics-Guided Diffusion Models for Data Assimilation in Atmospheric Systems

【速读】:该论文试图解决传统数据驱动的资料同化(Data Assimilation, DA)方法在处理复杂大气动力学时可能产生的物理不一致性问题,这些问题会损害下游应用的效果。解决方案的关键在于提出PhyDA,一个基于物理规律引导的扩散框架,其核心包括:(1)一种物理正则化的扩散目标,通过惩罚与已知物理定律(以偏微分方程形式表达)的偏差来将物理约束整合到训练过程中;(2)一种虚拟重构编码器,用于弥补观测稀疏性并生成结构化的潜在表示,从而增强模型推断完整且物理一致状态的能力。

链接: https://arxiv.org/abs/2505.12882
作者: Hao Wang,Jindong Han,Wei Fan,Weijia Zhang,Hao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data Assimilation (DA) plays a critical role in atmospheric science by reconstructing spatially continous estimates of the system state, which serves as initial conditions for scientific analysis. While recent advances in diffusion models have shown great potential for DA tasks, most existing approaches remain purely data-driven and often overlook the physical laws that govern complex atmospheric dynamics. As a result, they may yield physically inconsistent reconstructions that impair downstream applications. To overcome this limitation, we propose PhyDA, a physics-guided diffusion framework designed to ensure physical coherence in atmospheric data assimilation. PhyDA introduces two key components: (1) a Physically Regularized Diffusion Objective that integrates physical constraints into the training process by penalizing deviations from known physical laws expressed as partial differential equations, and (2) a Virtual Reconstruction Encoder that bridges observational sparsity for structured latent representations, further enhancing the model’s ability to infer complete and physically coherent states. Experiments on the ERA5 reanalysis dataset demonstrate that PhyDA achieves superior accuracy and better physical plausibility compared to state-of-the-art baselines. Our results emphasize the importance of combining generative modeling with domain-specific physical knowledge and show that PhyDA offers a promising direction for improving real-world data assimilation systems.
zh

[AI-52] AdS-GNN – a Conformally Equivariant Graph Neural Network

【速读】:该论文旨在解决如何构建一个对一般共形变换(conformal transformations)具有等变性的神经网络的问题。其解决方案的关键在于将数据从平坦的欧几里得空间提升到反德西特(Anti de Sitter, AdS)空间,从而利用平坦空间共形变换与AdS空间等距变换之间的已知对应关系,并结合几何深度学习中对一般几何上等距变换的广泛研究,最终通过基于固有距离的消息传递层实现计算高效的框架。

链接: https://arxiv.org/abs/2505.12880
作者: Maksim Zhdanov,Nabil Iqbal,Erik Bekkers,Patrick Forré
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Theory (hep-th)
备注:

点击查看摘要

Abstract:Conformal symmetries, i.e.\ coordinate transformations that preserve angles, play a key role in many fields, including physics, mathematics, computer vision and (geometric) machine learning. Here we build a neural network that is equivariant under general conformal transformations. To achieve this, we lift data from flat Euclidean space to Anti de Sitter (AdS) space. This allows us to exploit a known correspondence between conformal transformations of flat space and isometric transformations on the AdS space. We then build upon the fact that such isometric transformations have been extensively studied on general geometries in the geometric deep learning literature. We employ message-passing layers conditioned on the proper distance, yielding a computationally efficient framework. We validate our model on tasks from computer vision and statistical physics, demonstrating strong performance, improved generalization capacities, and the ability to extract conformal data such as scaling dimensions from the trained network.
zh

[AI-53] From Grunts to Grammar: Emergent Language from Cooperative Forag ing

【速读】:该论文试图解决语言如何从早期人类合作的生态和社会需求中演化而来的问题,以及语言如何在多智能体协作环境中产生、适应并成为团队协作的关键工具。其解决方案的关键在于构建一个基于多智能体觅食游戏(Foraging Games)的实验框架,通过端到端深度强化学习方法,使智能体在部分可观测的环境和有限的先验知识下,自主学习动作与通信策略,从而生成具有自然语言特征的通信协议。

链接: https://arxiv.org/abs/2505.12872
作者: Maytus Piriyajitakonkij,Rujikorn Charakorn,Weicheng Tao,Wei Pan,Mingfei Sun,Cheston Tan,Mengmi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Early cavemen relied on gestures, vocalizations, and simple signals to coordinate, plan, avoid predators, and share resources. Today, humans collaborate using complex languages to achieve remarkable results. What drives this evolution in communication? How does language emerge, adapt, and become vital for teamwork? Understanding the origins of language remains a challenge. A leading hypothesis in linguistics and anthropology posits that language evolved to meet the ecological and social demands of early human cooperation. Language did not arise in isolation, but through shared survival goals. Inspired by this view, we investigate the emergence of language in multi-agent Foraging Games. These environments are designed to reflect the cognitive and ecological constraints believed to have influenced the evolution of communication. Agents operate in a shared grid world with only partial knowledge about other agents and the environment, and must coordinate to complete games like picking up high-value targets or executing temporally ordered actions. Using end-to-end deep reinforcement learning, agents learn both actions and communication strategies from scratch. We find that agents develop communication protocols with hallmark features of natural language: arbitrariness, interchangeability, displacement, cultural transmission, and compositionality. We quantify each property and analyze how different factors, such as population size and temporal dependencies, shape specific aspects of the emergent language. Our framework serves as a platform for studying how language can evolve from partial observability, temporal reasoning, and cooperative goals in embodied multi-agent settings. We will release all data, code, and models publicly.
zh

[AI-54] Outsourced Privacy-Preserving Feature Selection Based on Fully Homomorphic Encryption

【速读】:该论文试图解决在多方数据所有者或数据所有者与分析者不同的情况下,如何在进行特征选择时保护数据隐私的问题。现有私有特征选择算法通常需要多个计算方,并且无法在缺乏完全信任的外部第三方环境中保证安全性。解决方案的关键在于提出一种基于全同态加密的特征选择外包算法,该算法首次实现了在无需信任外部第三方的情况下进行安全的特征选择,从而显著降低了时间复杂度和空间复杂度。

链接: https://arxiv.org/abs/2505.12869
作者: Koki Wakiyama,Tomohiro I,Hiroshi Sakamoto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Feature selection is a technique that extracts a meaningful subset from a set of features in training data. When the training data is large-scale, appropriate feature selection enables the removal of redundant features, which can improve generalization performance, accelerate the training process, and enhance the interpretability of the model. This study proposes a privacy-preserving computation model for feature selection. Generally, when the data owner and analyst are the same, there is no need to conceal the private information. However, when they are different parties or when multiple owners exist, an appropriate privacy-preserving framework is required. Although various private feature selection algorithms, they all require two or more computing parties and do not guarantee security in environments where no external party can be fully trusted. To address this issue, we propose the first outsourcing algorithm for feature selection using fully homomorphic encryption. Compared to a prior two-party algorithm, our result improves the time and space complexity O(kn^2) to O(kn log^3 n) and O(kn), where k and n denote the number of features and data samples, respectively. We also implemented the proposed algorithm and conducted comparative experiments with the naive one. The experimental result shows the efficiency of our method even with small datasets.
zh

[AI-55] FLTG: Byzantine-Robust Federated Learning via Angle-Based Defense and Non-IID-Aware Weighting

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中模型聚合阶段的拜占庭攻击问题,此类攻击通过操纵恶意客户端的更新来威胁训练的完整性。现有方法在高比例恶意客户端场景下表现出有限的鲁棒性,并且对非独立同分布(non-i.i.d.)数据敏感,导致性能下降。论文提出的FLTG算法通过结合基于角度的防御机制和动态参考选择策略作为解决方案的关键,首先利用ReLU截断余弦相似度过滤客户端,其次根据先前全局模型动态选择参考客户端以减轻非独立同分布偏差,并按角度偏差反比分配聚合权重,同时归一化更新幅度以抑制恶意缩放,从而在多种复杂数据集和经典攻击场景下展现出优越的鲁棒性。

链接: https://arxiv.org/abs/2505.12851
作者: Yanhua Wen,Lu Ai,Gang Liu,Chuang Li,Jianhao Wei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, BlockSys2025

点击查看摘要

Abstract:Byzantine attacks during model aggregation in Federated Learning (FL) threaten training integrity by manipulating malicious clients’ updates. Existing methods struggle with limited robustness under high malicious client ratios and sensitivity to non-i.i.d. data, leading to degraded accuracy. To address this, we propose FLTG, a novel aggregation algorithm integrating angle-based defense and dynamic reference selection. FLTG first filters clients via ReLU-clipped cosine similarity, leveraging a server-side clean dataset to exclude misaligned updates. It then dynamically selects a reference client based on the prior global model to mitigate non-i.i.d. bias, assigns aggregation weights inversely proportional to angular deviations, and normalizes update magnitudes to suppress malicious scaling. Evaluations across datasets of varying complexity under five classic attacks demonstrate FLTG’s superiority over state-of-the-art methods under extreme bias scenarios and sustains robustness with a higher proportion(over 50%) of malicious clients.
zh

[AI-56] Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks

【速读】:该论文试图解决当前基于强化学习的人类反馈(RLHF)在处理复杂多指令任务时表现出的合规能力不足问题,以及传统方法依赖人工标注或大型语言模型所带来的资源消耗和潜在偏见问题。其解决方案的关键在于识别现有技术对提示输入中隐含信号的忽视,以及仅关注样本内偏好差异而忽略样本间偏好差异的局限性,并提出一种多层级感知偏好学习(MAPL)框架,通过构建不同条件下的多样化提示和合成多指令偏好对,从而提升模型在多指令任务中的表现。

链接: https://arxiv.org/abs/2505.12845
作者: Ruopei Sun,Jianfeng Cai,Jinhua Zhu,Kangwen Zhao,Dongyun Xue,Wengang Zhou,Li Li,Houqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences, demonstrating exceptional and measurable efficacy in instruction following tasks; however, it exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks. Conventional approaches rely heavily on human annotation or more sophisticated large language models, thereby introducing substantial resource expenditure or potential bias concerns. Meanwhile, alternative synthetic methods that augment standard preference datasets often compromise the model’s semantic quality. Our research identifies a critical oversight in existing techniques, which predominantly focus on comparing responses while neglecting valuable latent signals embedded within prompt inputs, and which only focus on preference disparities at the intra-sample level, while neglecting to account for the inter-sample level preference differentials that exist among preference data. To leverage these previously neglected indicators, we propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities. Specifically, for any given response in original preference data pairs, we construct varied prompts with a preference relation under different conditions, in order to learn intra-sample level preference disparities. Furthermore, for any given original preference pair, we synthesize multi-instruction preference pairs to capture preference discrepancies at the inter-sample level. Building on the two datasets constructed above, we consequently devise two sophisticated training objective functions. Subsequently, our framework integrates seamlessly into both Reward Modeling and Direct Preference Optimization paradigms. Through rigorous evaluation across multiple benchmarks, we empirically validate the efficacy of our framework.
zh

[AI-57] AGI-Elo: How Far Are We From Mastering A Task?

【速读】:该论文试图解决当前评估框架在衡量人工智能模型(或人类)能力时过于依赖聚合性能指标,而缺乏对任务难度和模型能力之间复杂关系的深入理解的问题。其解决方案的关键在于提出一种统一的评分系统,该系统通过联合建模个体测试用例的难度与模型(或人类)的能力,实现对视觉、语言和行动领域任务的细粒度、难度感知的评估。该方法通过模型与任务之间的竞争性交互,捕捉现实世界挑战的长尾分布以及当前模型与全任务掌握之间的能力差距。

链接: https://arxiv.org/abs/2505.12844
作者: Shuo Sun,Yimin Zhao,Christina Dao Wen Lee,Jiawei Sun,Chengran Yuan,Zefan Huang,Dongen Li,Justin KW Yeoh,Alok Prakash,Thomas W. Malone,Marcelo H. Ang Jr
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As the field progresses toward Artificial General Intelligence (AGI), there is a pressing need for more comprehensive and insightful evaluation frameworks that go beyond aggregate performance metrics. This paper introduces a unified rating system that jointly models the difficulty of individual test cases and the competency of AI models (or humans) across vision, language, and action domains. Unlike existing metrics that focus solely on models, our approach allows for fine-grained, difficulty-aware evaluations through competitive interactions between models and tasks, capturing both the long-tail distribution of real-world challenges and the competency gap between current models and full task mastery. We validate the generalizability and robustness of our system through extensive experiments on multiple established datasets and models across distinct AGI domains. The resulting rating distributions offer novel perspectives and interpretable insights into task difficulty, model progression, and the outstanding challenges that remain on the path to achieving full AGI task mastery.
zh

[AI-58] Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

【速读】:该论文试图解决强化学习与人类反馈(Reinforcement Learning from Human Feedback, RLHF)中奖励模型存在的长度偏差(length bias)问题,即奖励模型倾向于偏好较长的回复,而忽视实际回复质量。解决方案的关键在于提出FiMi-RM(Bias Fitting to Mitigate Length Bias of Reward Model in RLHF)框架,该框架通过三个阶段自主学习并校正潜在的偏差模式:首先训练包含长度偏差的标准奖励模型,其次部署轻量级拟合模型以捕捉长度与奖励之间的非线性关系,最后将学习到的关系整合到奖励模型中以实现去偏。

链接: https://arxiv.org/abs/2505.12843
作者: Kangwen Zhao,Jianfeng Cai,Jinhua Zhu,Ruopei Sun,Dongyun Xue,Wengang Zhou,Li Li,Houqiang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Due to the word limit for arXiv abstract, the abstract here has been abridged compared to the one in the PDF

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward model to maximize reward scores without genuinely aligning with human preferences. A significant example of such reward hacking is length bias, where reward models usually favor longer responses irrespective of actual response quality. Previous works on length bias have notable limitations, these approaches either mitigate bias without characterizing the bias form, or simply assume a linear length-reward relation. To accurately model the intricate nature of length bias and facilitate more effective bias mitigation, we propose FiMi-RM (Bias Fitting to Mitigate Length Bias of Reward Model in RLHF), a framework that autonomously learns and corrects underlying bias patterns. Our approach consists of three stages: First, we train a standard reward model which inherently contains length bias. Next, we deploy a lightweight fitting model to explicitly capture the non-linear relation between length and reward. Finally, we incorporate this learned relation into the reward model to debias. Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution. Furthermore, when applied to alignment algorithms, our debiased reward model improves length-controlled win rate and reduces verbosity without compromising its performance.
zh

[AI-59] Reasoning BO: Enhancing Bayesian Optimization with Long-Context Reasoning Power of LLM s

【速读】:该论文旨在解决昂贵黑箱函数优化问题,此类问题在科学和工业应用中普遍存在。传统贝叶斯优化(Bayesian Optimization, BO)方法容易陷入局部最优且缺乏可解释性洞察。论文提出的解决方案是设计一种名为Reasoning BO的新框架,其关键在于利用推理模型引导采样过程,并结合多智能体系统和知识图谱实现在线知识积累,同时融合大语言模型(Large Language Models, LLMs)的推理与上下文理解能力,以提供强有力的优化指导。通过实时洞察和假设演化,该框架能够逐步优化采样策略,提升搜索空间中高性能区域的探索效率。

链接: https://arxiv.org/abs/2505.12833
作者: Zhuo Yang,Lingli Ge,Dong Han,Tianfan Fu,Yuqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many real-world scientific and industrial applications require the optimization of expensive black-box functions. Bayesian Optimization (BO) provides an effective framework for such problems. However, traditional BO methods are prone to get trapped in local optima and often lack interpretable insights. To address this issue, this paper designs Reasoning BO, a novel framework that leverages reasoning models to guide the sampling process in BO while incorporating multi-agent systems and knowledge graphs for online knowledge accumulation. By integrating the reasoning and contextual understanding capabilities of Large Language Models (LLMs), we can provide strong guidance to enhance the BO process. As the optimization progresses, Reasoning BO provides real-time sampling recommendations along with critical insights grounded in plausible scientific theories, aiding in the discovery of superior solutions within the search space. We systematically evaluate our approach across 10 diverse tasks encompassing synthetic mathematical functions and complex real-world applications. The framework demonstrates its capability to progressively refine sampling strategies through real-time insights and hypothesis evolution, effectively identifying higher-performing regions of the search space for focused exploration. This process highlights the powerful reasoning and context-learning abilities of LLMs in optimization scenarios. For example, in the Direct Arylation task, our method increased the yield to 60.7%, whereas traditional BO achieved only a 25.2% yield. Furthermore, our investigation reveals that smaller LLMs, when fine-tuned through reinforcement learning, can attain comparable performance to their larger counterparts. This enhanced reasoning capability paves the way for more efficient automated scientific experimentation while maintaining computational feasibility.
zh

[AI-60] Emergent Specialization: Rare Token Neurons in Language Models

【速读】:该论文试图解决大型语言模型在表示和生成罕见标记(rare tokens)方面的困难,尽管这些标记在专业领域中具有重要性。解决方案的关键在于识别出对模型预测罕见标记具有显著影响的神经元结构,称为罕见标记神经元(rare token neurons),并揭示其出现与行为机制。研究发现,这些神经元呈现出动态演变的三阶段组织结构(平台期、幂律期和快速衰减期),并在激活空间中形成一个协调的子网络,选择性地共同激活而避免与其他神经元同时激活,这可能与重尾权重分布的发展相关,暗示了涌现特异性的统计力学基础。

链接: https://arxiv.org/abs/2505.12822
作者: Jing Liu,Haozheng Wang,Yueheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Large language models struggle with representing and generating rare tokens despite their importance in specialized domains. In this study, we identify neuron structures with exceptionally strong influence on language model’s prediction of rare tokens, termed as rare token neurons, and investigate the mechanism for their emergence and behavior. These neurons exhibit a characteristic three-phase organization (plateau, power-law, and rapid decay) that emerges dynamically during training, evolving from a homogeneous initial state to a functionally differentiated architecture. In the activation space, rare token neurons form a coordinated subnetwork that selectively co-activates while avoiding co-activation with other neurons. This functional specialization potentially correlates with the development of heavy-tailed weight distributions, suggesting a statistical mechanical basis for emergent specialization.
zh

[AI-61] Learning in Chaos: Efficient Autoscaling and Self-healing for Distributed Training at the Edge

【速读】:该论文旨在解决边缘人工智能(Edge AI)集群中频繁的节点和链路变化对分布式训练造成的干扰问题,以及传统基于检查点的恢复机制和以云为中心的自动扩展在横向扩展中的延迟过高且不适用于混乱和自管理边缘环境的问题。其解决方案的关键在于提出了一种名为Chaos的弹性且可扩展的边缘分布式训练系统,该系统内置自我修复和自动扩展功能。Chaos通过多邻居复制与快速分片调度加速横向扩展,使新节点能够并行从邻近节点拉取最新训练状态并平衡流量负载;同时利用集群监控跟踪资源和拓扑变化以辅助调度决策,并通过对等协商协议处理扩展事件,实现无需中心管理员的完全自管理自动扩展。

链接: https://arxiv.org/abs/2505.12815
作者: Wenjiao Feng,Rongxing Xiao,Zonghang Li,Hongfang Yu,Gang Sun,Long Luo,Mohsen Guizani,Qirong Ho
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 13 pages, 16 figures

点击查看摘要

Abstract:Frequent node and link changes in edge AI clusters disrupt distributed training, while traditional checkpoint-based recovery and cloud-centric autoscaling are too slow for scale-out and ill-suited to chaotic and self-governed edge. This paper proposes Chaos, a resilient and scalable edge distributed training system with built-in self-healing and autoscaling. It speeds up scale-out by using multi-neighbor replication with fast shard scheduling, allowing a new node to pull the latest training state from nearby neighbors in parallel while balancing the traffic load between them. It also uses a cluster monitor to track resource and topology changes to assist scheduler decisions, and handles scaling events through peer negotiation protocols, enabling fully self-governed autoscaling without a central admin. Extensive experiments show that Chaos consistently achieves much lower scale-out delays than Pollux, EDL, and Autoscaling, and handles scale-in, connect-link, and disconnect-link events within 1 millisecond, making it smoother to handle node joins, exits, and failures. It also delivers the lowest idle time, showing superior resource use and scalability as the cluster grows.
zh

[AI-62] Dynamic Sight Range Selection in Multi-Agent Reinforcement Learning AAMAS2025

【速读】:该论文旨在解决多智能体强化学习(MARL)中的视野范围困境,即智能体从环境中获取的信息不足或过量的问题。其解决方案的关键在于提出一种名为动态视野范围选择(DSR)的方法,该方法利用上界置信度(UCB)算法,在训练过程中动态调整智能体的视野范围,从而优化信息获取并提升学习效率。

链接: https://arxiv.org/abs/2505.12811
作者: Wei-Chen Liao,Ti-Rong Wu,I-Chen Wu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAMAS 2025. The compiled PDF includes the appendix

点击查看摘要

Abstract:Multi-agent reinforcement Learning (MARL) is often challenged by the sight range dilemma, where agents either receive insufficient or excessive information from their environment. In this paper, we propose a novel method, called Dynamic Sight Range Selection (DSR), to address this issue. DSR utilizes an Upper Confidence Bound (UCB) algorithm and dynamically adjusts the sight range during training. Experiment results show several advantages of using DSR. First, we demonstrate using DSR achieves better performance in three common MARL environments, including Level-Based Foraging (LBF), Multi-Robot Warehouse (RWARE), and StarCraft Multi-Agent Challenge (SMAC). Second, our results show that DSR consistently improves performance across multiple MARL algorithms, including QMIX and MAPPO. Third, DSR offers suitable sight ranges for different training steps, thereby accelerating the training process. Finally, DSR provides additional interpretability by indicating the optimal sight range used during training. Unlike existing methods that rely on global information or communication mechanisms, our approach operates solely based on the individual sight ranges of agents. This approach offers a practical and efficient solution to the sight range dilemma, making it broadly applicable to real-world complex environments.
zh

[AI-63] FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA

【速读】:该论文试图解决在联邦学习中结合差分隐私随机梯度下降(DP-SGD)时,低秩适配(LoRA)方法面临显著的噪声放大问题。其关键解决方案是提出FedSVD,通过基于奇异值分解(SVD)的全局重参数化机制,使每个客户端仅优化并上传B矩阵,服务器则利用历史A矩阵计算BA并进行SVD分解,从而生成新的A矩阵和更新后的B矩阵。该方法避免了二次噪声放大,同时增强了模型对聚合更新主方向的捕捉能力,并通过A矩阵的正交结构限制B矩阵的梯度范数,提升在DP-SGD下的信号保留能力。

链接: https://arxiv.org/abs/2505.12805
作者: Seanie Lee,Sangwoo Park,Dong Bok Lee,Dominik Wagner,Haebin Seong,Tobias Bocklet,Juho Lee,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA), which introduces a product of two trainable low-rank matrices into frozen pre-trained weights, is widely used for efficient fine-tuning of language models in federated learning (FL). However, when combined with differentially private stochastic gradient descent (DP-SGD), LoRA faces substantial noise amplification: DP-SGD perturbs per-sample gradients, and the matrix multiplication of the LoRA update ( BA ) intensifies this effect. Freezing one matrix (e.g., A ) reduces the noise but restricts model expressiveness, often resulting in suboptimal adaptation. To address this, we propose FedSVD, a simple yet effective method that introduces a global reparameterization based on singular value decomposition (SVD). In our approach, each client optimizes only the B matrix and transmits it to the server. The server aggregates the B matrices, computes the product BA using the previous A , and refactorizes the result via SVD. This yields a new adaptive A composed of the orthonormal right singular vectors of BA , and an updated B containing the remaining SVD components. This reparameterization avoids quadratic noise amplification, while allowing A to better capture the principal directions of the aggregate updates. Moreover, the orthonormal structure of A bounds the gradient norms of B and preserves more signal under DP-SGD, as confirmed by our theoretical analysis. As a result, FedSVD consistently improves stability and performance across a variety of privacy settings and benchmarks, outperforming relevant baselines under both private and non-private regimes.
zh

[AI-64] OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching

【速读】:该论文旨在解决传统文本到语音(Text-to-Speech, TTS)系统在建模语音属性时的局限性以及高计算成本的问题。其解决方案的关键在于提出OZSpeech,这是首个将最优传输条件流匹配与单步采样及学习先验作为条件相结合的TTS方法,通过解耦和因子化的语音组件进行建模,从而有效忽略前序状态并减少采样步骤,提升语音内容准确性、自然度、韵律生成和说话人风格保留的能力。

链接: https://arxiv.org/abs/2505.12800
作者: Hieu-Nghia Huynh-Nguyen,Ngoc Son Nguyen,Huynh Nguyen Dang,Thieu Vo,Truong-Son Hy,Van Nguyen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Text-to-speech (TTS) systems have seen significant advancements in recent years, driven by improvements in deep learning and neural network architectures. Viewing the output speech as a data distribution, previous approaches often employ traditional speech representations, such as waveforms or spectrograms, within the Flow Matching framework. However, these methods have limitations, including overlooking various speech attributes and incurring high computational costs due to additional constraints introduced during training. To address these challenges, we introduce OZSpeech, the first TTS method to explore optimal transport conditional flow matching with one-step sampling and a learned prior as the condition, effectively disregarding preceding states and reducing the number of sampling steps. Our approach operates on disentangled, factorized components of speech in token format, enabling accurate modeling of each speech attribute, which enhances the TTS system’s ability to precisely clone the prompt speech. Experimental results show that our method achieves promising performance over existing methods in content accuracy, naturalness, prosody generation, and speaker style preservation. Audio samples are available at our demo page this https URL.
zh

[AI-65] FRAbench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks Modalities

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)的开放性输出评估问题,这一问题随着模型能力、任务多样性和模态覆盖范围的快速扩展而成为瓶颈。现有“LLM作为评判者”的评估方法通常局限于特定任务、方面或模态,并且容易出现一致性低的问题。论文提出,明确且细粒度的方面规范是实现自动化评估通用性和客观性的关键。为此,作者引入了一个涵盖112个方面的分层方面分类法,统一了自然语言生成、图像理解、图像生成以及文本与图像混合生成四个典型场景的评估。基于该分类法,构建了FRAbench基准,包含60.4k对样本和325k个细粒度方面标签,提供了首个大规模多模态资源用于训练和元评估细粒度的多模态模型(Multimodal Models, MMs)评判者。利用FRAbench,作者开发了GenEval,一个跨任务和模态的细粒度评估器。实验表明,GenEval在与GPT-4o和专家标注者的一致性、对未见任务和模态的迁移能力以及揭示当前多模态模型系统性弱点方面表现优异。

链接: https://arxiv.org/abs/2505.12795
作者: Shibo Hong,Jiahao Ying,Haiyuan Liang,Mengdi Zhang,Jun Kuang,Jiazheng Zhang,Yixin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating the open-ended outputs of large language models (LLMs) has become a bottleneck as model capabilities, task diversity, and modality coverage rapidly expand. Existing “LLM-as-a-Judge” evaluators are typically narrow in a few tasks, aspects, or modalities, and easily suffer from low consistency. In this paper, we argue that explicit, fine-grained aspect specification is the key to both generalizability and objectivity in automated evaluation. To do so, we introduce a hierarchical aspect taxonomy spanning 112 aspects that unifies evaluation across four representative settings - Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. Building on this taxonomy, we create FRAbench, a benchmark comprising 60.4k pairwise samples with 325k aspect-level labels obtained from a combination of human and LLM annotations. FRAbench provides the first large-scale, multi-modal resource for training and meta-evaluating fine-grained LMM judges. Leveraging FRAbench, we develop GenEval, a fine-grained evaluator generalizable across tasks and modalities. Experiments show that GenEval (i) attains high agreement with GPT-4o and expert annotators, (ii) transfers robustly to unseen tasks and modalities, and (iii) reveals systematic weaknesses of current LMMs on evaluation.
zh

[AI-66] Mixture Policy based Multi-Hop Reasoning over N-tuple Temporal Knowledge Graphs

【速读】:该论文旨在解决现有N-Tuple Temporal Knowledge Graphs (N-TKGs)推理方法缺乏可解释性的问题,因为这些方法通常具有黑箱特性。其解决方案的关键在于提出一种基于强化学习的方法MT-Path,该方法利用时间信息遍历历史n元组以构建时间推理路径,并通过混合策略驱动的动作选择器整合谓词中的实体无关信息、核心元素信息以及整个n元组的完整信息,同时采用辅助元素感知的图卷积网络(GCN)捕捉事实间的丰富语义依赖关系,从而提升模型对每个n元组的深度理解能力。

链接: https://arxiv.org/abs/2505.12788
作者: Zhongni Hou,Miao Su,Xiaolong Jin,Zixuan Li,Long Bai,Jiafeng Guo,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal Knowledge Graphs (TKGs), which utilize quadruples in the form of (subject, predicate, object, timestamp) to describe temporal facts, have attracted extensive attention. N-tuple TKGs (N-TKGs) further extend traditional TKGs by utilizing n-tuples to incorporate auxiliary elements alongside core elements (i.e., subject, predicate, and object) of facts, so as to represent them in a more fine-grained manner. Reasoning over N-TKGs aims to predict potential future facts based on historical ones. However, existing N-TKG reasoning methods often lack explainability due to their black-box nature. Therefore, we introduce a new Reinforcement Learning-based method, named MT-Path, which leverages the temporal information to traverse historical n-tuples and construct a temporal reasoning path. Specifically, in order to integrate the information encapsulated within n-tuples, i.e., the entity-irrelevant information within the predicate, the information about core elements, and the complete information about the entire n-tuples, MT-Path utilizes a mixture policy-driven action selector, which bases on three low-level policies, namely, the predicate-focused policy, the core-element-focused policy and the whole-fact-focused policy. Further, MT-Path utilizes an auxiliary element-aware GCN to capture the rich semantic dependencies among facts, thereby enabling the agent to gain a deep understanding of each n-tuple. Experimental results demonstrate the effectiveness and the explainability of MT-Path.
zh

[AI-67] Language Models That Walk the Talk: A Framework for Formal Fairness Certificates

【速读】:该论文试图解决大型语言模型在高风险应用中面临的鲁棒性和公平性问题,特别是在面对对抗性攻击时,如通过同义词替换等微小扰动导致模型预测错误,这在性别偏见缓解和毒性检测等关键领域可能带来严重风险。解决方案的关键在于提出一个全面的验证框架,用于证明基于Transformer的语言模型的鲁棒性,重点确保性别公平性和不同性别相关术语下的输出一致性,并将该方法扩展至毒性检测,提供形式化保证以确保对抗性操纵的有毒输入能够被一致检测和适当过滤。通过在嵌入空间中形式化鲁棒性,该工作增强了语言模型在伦理AI部署和内容审核中的可靠性。

链接: https://arxiv.org/abs/2505.12767
作者: Danqing Chen,Tobias Ladner,Ahmed Rayen Mhadhbi,Matthias Althoff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models become integral to high-stakes applications, ensuring their robustness and fairness is critical. Despite their success, large language models remain vulnerable to adversarial attacks, where small perturbations, such as synonym substitutions, can alter model predictions, posing risks in fairness-critical areas, such as gender bias mitigation, and safety-critical areas, such as toxicity detection. While formal verification has been explored for neural networks, its application to large language models remains limited. This work presents a holistic verification framework to certify the robustness of transformer-based language models, with a focus on ensuring gender fairness and consistent outputs across different gender-related terms. Furthermore, we extend this methodology to toxicity detection, offering formal guarantees that adversarially manipulated toxic inputs are consistently detected and appropriately censored, thereby ensuring the reliability of moderation systems. By formalizing robustness within the embedding space, this work strengthens the reliability of language models in ethical AI deployment and content moderation.
zh

[AI-68] IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment

【速读】:该论文试图解决多领域混合训练数据集中不同领域数据量对大型语言模型(Large Language Models, LLMs)涌现能力的影响问题,特别是在已有高质量多领域训练数据的情况下,如何优化各领域数据的配比以提升模型在多个任务上的整体性能。解决方案的关键在于提出IDEAL框架,该框架采用基于梯度的方法,通过迭代优化不同领域数据的分布,动态调整各领域数据量以反映其对下游任务性能的影响,从而实现数据集的平衡配置,增强模型在多任务场景下的对齐效果和性能表现。

链接: https://arxiv.org/abs/2505.12762
作者: Chenlin Ming,Chendi Qu,Mengzhang Cai,Qizhi Pei,Zhuoshi Pan,Yu Li,Xiaoming Duan,Lijun Wu,Conghui He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets. When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model’s performance. Unlike many studies that focus on enhancing the quality of training datasets through data selection methods, few works explore the intricate relationship between the compositional quantity of mixture training datasets and the emergent capabilities of LLMs. Given the availability of a high-quality multi-domain training dataset, understanding the impact of data from each domain on the model’s overall capabilities is crucial for preparing SFT data and training a well-balanced model that performs effectively across diverse domains. In this work, we introduce IDEAL, an innovative data equilibrium adaptation framework designed to effectively optimize volumes of data from different domains within mixture SFT datasets, thereby enhancing the model’s alignment and performance across multiple capabilities. IDEAL employs a gradient-based approach to iteratively refine the training data distribution, dynamically adjusting the volumes of domain-specific data based on their impact on downstream task performance. By leveraging this adaptive mechanism, IDEAL ensures a balanced dataset composition, enabling the model to achieve robust generalization and consistent proficiency across diverse tasks. Experiments across different capabilities demonstrate that IDEAL outperforms conventional uniform data allocation strategies, achieving a comprehensive improvement of approximately 7% in multi-task evaluation scores.
zh

[AI-69] Enhancing Channel-Independent Time-Series Forecasting via Cross-Variate Patch Embedding

【速读】:该论文试图解决时间序列预测中现有模型仅关注时间依赖性而忽略变量间复杂关系的问题,尤其是针对完全通道依赖(channel-dependent, CD)模型易过拟合的缺陷。解决方案的关键在于提出一种轻量级的通道独立(channel-independent, CI)模块——跨变量块嵌入(Cross-Variate Patch Embeddings, CVPE),通过修改块嵌入过程注入跨变量上下文,具体方法包括添加可学习的位置编码和轻量级路由注意力模块,从而提升CI模型在捕捉变量间依赖关系上的能力。

链接: https://arxiv.org/abs/2505.12761
作者: Donghwa Shin,Edwin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers have recently gained popularity in time series forecasting due to their ability to capture long-term dependencies. However, many existing models focus only on capturing temporal dependencies while omitting intricate relationships between variables. Recent models have tried tackling this by explicitly modeling both cross-time and cross-variate dependencies through a sequential or unified attention mechanism, but they are entirely channel dependent (CD) across all layers, making them potentially susceptible to overfitting. To address this, we propose Cross-Variate Patch Embeddings (CVPE), a lightweight CD module that injects cross-variate context into channel-independent (CI) models by simply modifying the patch embedding process. We achieve this by adding a learnable positional encoding and a lightweight router-attention block to the vanilla patch embedding layer. We then integrate CVPE into Time-LLM, a multimodal CI forecasting model, to demonstrate its effectiveness in capturing cross-variate dependencies and enhance the CI model’s performance. Extensive experimental results on seven real-world datasets show that our enhanced Time-LLM outperforms the original baseline model simply by incorporating the CVPE module, with no other changes.
zh

[AI-70] Malware families discovery via Open-Set Recognition on Android manifest permissions ECAI2025

【速读】:该论文试图解决恶意软件(Malware)分类任务中的挑战,特别是在面对高维权限数据和有限训练样本的情况下,以及由于新型恶意软件家族不断出现而导致无法获取覆盖所有恶意软件类别的完整训练集的问题。解决方案的关键在于将计算机视觉领域开发的开放集识别技术MaxLogit与基于树的梯度提升分类器相结合,该方法在处理高维数据方面表现出色,并且能够有效检测未知的恶意软件家族。

链接: https://arxiv.org/abs/2505.12750
作者: Filippo Leveni,Matteo Mistura,Francesco Iubatti,Carmine Giangregorio,Nicolò Pastore,Cesare Alippi,Giacomo Boracchi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Malware are malicious programs that are grouped into families based on their penetration technique, source code, and other characteristics. Classifying malware programs into their respective families is essential for building effective defenses against cyber threats. Machine learning models have a huge potential in malware detection on mobile devices, as malware families can be recognized by classifying permission data extracted from Android manifest files. Still, the malware classification task is challenging due to the high-dimensional nature of permission data and the limited availability of training samples. In particular, the steady emergence of new malware families makes it impossible to acquire a comprehensive training set covering all the malware classes. In this work, we present a malware classification system that, on top of classifying known malware, detects new ones. In particular, we combine an open-set recognition technique developed within the computer vision community, namely MaxLogit, with a tree-based Gradient Boosting classifier, which is particularly effective in classifying high-dimensional data. Our solution turns out to be very practical, as it can be seamlessly employed in a standard classification workflow, and efficient, as it adds minimal computational overhead. Experiments on public and proprietary datasets demonstrate the potential of our solution, which has been deployed in a business environment.
zh

[AI-71] Correspondence of high-dimensional emotion structures elicited by video clips between humans and Multimodal LLM s

【速读】:该论文试图解决如何评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在捕捉人类情绪的高维复杂结构方面的表现问题。其解决方案的关键在于通过对比参与者观看视频时自我报告的情绪评分与模型生成的情绪估计(如Gemini或GPT),从个体视频层面以及考虑视频间关系的情绪结构层面进行性能评估,并利用Gromov Wasserstein最优传输方法分析人类与模型在情绪结构上的相似性,从而揭示模型在类别层面捕捉情绪复杂结构的能力及其在单条目层面的局限性。

链接: https://arxiv.org/abs/2505.12746
作者: Haruka Asanuma,Naoko Koide-Majima,Ken Nakamura,Takato Horii,Shinji Nishimoto,Masafumi Oizumi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Recent studies have revealed that human emotions exhibit a high-dimensional, complex structure. A full capturing of this complexity requires new approaches, as conventional models that disregard high dimensionality risk overlooking key nuances of human emotions. Here, we examined the extent to which the latest generation of rapidly evolving Multimodal Large Language Models (MLLMs) capture these high-dimensional, intricate emotion structures, including capabilities and limitations. Specifically, we compared self-reported emotion ratings from participants watching videos with model-generated estimates (e.g., Gemini or GPT). We evaluated performance not only at the individual video level but also from emotion structures that account for inter-video relationships. At the level of simple correlation between emotion structures, our results demonstrated strong similarity between human and model-inferred emotion structures. To further explore whether the similarity between humans and models is at the signle item level or the coarse-categorical level, we applied Gromov Wasserstein Optimal Transport. We found that although performance was not necessarily high at the strict, single-item level, performance across video categories that elicit similar emotions was substantial, indicating that the model could infer human emotional experiences at the category level. Our results suggest that current state-of-the-art MLLMs broadly capture the complex high-dimensional emotion structures at the category level, as well as their apparent limitations in accurately capturing entire structures at the single-item level.
zh

[AI-72] PEER pressure: Model-to-Model Regularization for Single Source Domain Generalization CVPR2025

【速读】:该论文试图解决单源域泛化中基于数据增强的方法在目标域上的性能在训练过程中普遍出现波动的问题,这给实际场景下的模型选择带来了挑战。解决方案的关键在于提出一种名为参数空间集成与熵正则化(PEER)的新泛化方法,该方法通过引入一个代理模型来代表主模型学习增强数据,并通过平均主模型与代理模型的参数来逐步累积知识,同时最大化两个模型输出表示之间的互信息以缓解训练过程中的特征失真。

链接: https://arxiv.org/abs/2505.12745
作者: Dong Kyu Cho,Inwoo Hwang,Sanghack Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures, Accepted at CVPR 2025

点击查看摘要

Abstract:Data augmentation is a popular tool for single source domain generalization, which expands the source domain by generating simulated ones, improving generalization on unseen target domains. In this work, we show that the performance of such augmentation-based methods in the target domains universally fluctuates during training, posing challenges in model selection under realistic scenarios. We argue that the fluctuation stems from the inability of the model to accumulate the knowledge learned from diverse augmentations, exacerbating feature distortion during training. Based on this observation, we propose a novel generalization method, coined Parameter-Space Ensemble with Entropy Regularization (PEER), that uses a proxy model to learn the augmented data on behalf of the main model. The main model is updated by averaging its parameters with the proxy model, progressively accumulating knowledge over the training steps. Maximizing the mutual information between the output representations of the two models guides the learning process of the proxy model, mitigating feature distortion during training. Experimental results demonstrate the effectiveness of PEER in reducing the OOD performance fluctuation and enhancing generalization across various datasets, including PACS, Digits, Office-Home, and VLCS. Notably, our method with simple random augmentation achieves state-of-the-art performance, surpassing prior approaches on sDG that utilize complex data augmentation strategies.
zh

[AI-73] Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation

【速读】:该论文旨在解决如何将大型多模态模型(Large Multimodal Models, LMMs)的推理能力有效应用于机器人操作的问题,特别是如何使LMMs直接通过语言推理生成下一步操作目标,而非依赖独立的动作头。其解决方案的关键在于两个方面:首先,提出一种新的任务形式,输入物体部件和夹爪的当前状态,并采用新的轴表示法替代传统的欧拉角来重构旋转,以增强空间推理的兼容性和可解释性;其次,设计一条管道,利用先进的LMM生成高质量的多轮对话推理数据集,用于监督微调,并通过模拟中的试错交互进行强化学习,进一步提升模型在机器人操作中的推理能力。

链接: https://arxiv.org/abs/2505.12744
作者: Weiliang Tang,Dong Jing,Jia-Hui Pan,Zhiwu Lu,Yun-Hui Liu,Li Erran Li,Mingyu Ding,Chi-Wing Fu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 16 figures

点击查看摘要

Abstract:Recent Large Multimodal Models have demonstrated remarkable reasoning capabilities, especially in solving complex mathematical problems and realizing accurate spatial perception. Our key insight is that these emerging abilities can naturally extend to robotic manipulation by enabling LMMs to directly infer the next goal in language via reasoning, rather than relying on a separate action head. However, this paradigm meets two main challenges: i) How to make LMMs understand the spatial action space, and ii) How to fully exploit the reasoning capacity of LMMs in solving these tasks. To tackle the former challenge, we propose a novel task formulation, which inputs the current states of object parts and the gripper, and reformulates rotation by a new axis representation instead of traditional Euler angles. This representation is more compatible with spatial reasoning and easier to interpret within a unified language space. For the latter challenge, we design a pipeline to utilize cutting-edge LMMs to generate a small but high-quality reasoning dataset of multi-round dialogues that successfully solve manipulation tasks for supervised fine-tuning. Then, we perform reinforcement learning by trial-and-error interactions in simulation to further enhance the model’s reasoning abilities for robotic manipulation. Our resulting reasoning model built upon a 7B backbone, named ReasonManip, demonstrates three notable advantages driven by its system-2 level reasoning capabilities: i) exceptional generalizability to out-of-distribution environments, objects, and tasks; ii) inherent sim-to-real transfer ability enabled by the unified language representation shared across domains; iii) transparent interpretability connecting high-level reasoning and low-level control. Extensive experiments demonstrate the effectiveness of the proposed paradigm and its potential to advance LMM-driven robotic manipulation.
zh

[AI-74] Dense Communication between Language Models

【速读】:该论文试图解决如何高效构建具有集体智能的大型语言模型(Large Language Models, LLMs)的问题,特别是针对当前系统依赖自然语言进行通信所带来的效率瓶颈。其解决方案的关键在于提出一种基于LLMs之间直接密集向量通信的新范式,通过消除LLM交互时不必要的嵌入和解嵌入步骤,实现更高效的信息传递、端到端可微优化路径以及超越人类启发式方法的能力探索。该方法将简化后的LLMs作为节点,可优化的序列到序列模块作为边,构建出类似多层感知机(MLP)结构的LMNet,从而在保持性能的同时显著降低训练成本。

链接: https://arxiv.org/abs/2505.12741
作者: Shiguang Wu,Yaqing Wang,Quanming Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As higher-level intelligence emerges from the combination of modular components with lower-level intelligence, many works combines Large Language Models (LLMs) for collective intelligence. Such combination is achieved by building communications among LLMs. While current systems primarily facilitate such communication through natural language, this paper proposes a novel paradigm of direct dense vector communication between LLMs. Our approach eliminates the unnecessary embedding and de-embedding steps when LLM interact with another, enabling more efficient information transfer, fully differentiable optimization pathways, and exploration of capabilities beyond human heuristics. We use such stripped LLMs as vertexes and optimizable seq2seq modules as edges to construct LMNet, with similar structure as MLPs. By utilizing smaller pre-trained LLMs as vertexes, we train a LMNet that achieves comparable performance with LLMs in similar size with only less than 0.1% training cost. This offers a new perspective on scaling for general intelligence rather than training a monolithic LLM from scratch. Besides, the proposed method can be used for other applications, like customizing LLM with limited data, showing its versatility.
zh

[AI-75] EpiLLM : Unlocking the Potential of Large Language Models in Epidemic Forecasting

【速读】:该论文旨在解决传染病预测中的复杂时空模式建模问题,以提升精准防控策略的制定能力。其解决方案的关键在于提出一种基于大型语言模型(Large Language Model, LLM)的框架EpiLLM,通过双分支架构实现感染病例与人群流动等关键因素与语言标记的细粒度对齐,并采用自回归建模范式将疫情预测任务转化为下一步标记预测,同时引入时空提示学习技术从数据驱动角度增强模型对疫情的感知能力。

链接: https://arxiv.org/abs/2505.12738
作者: Chenghua Gong,Rui Sun,Yuhao Zheng,Juyuan Zhang,Tianjun Gu,Liming Pan,Linyuan Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 18 pages

点击查看摘要

Abstract:Advanced epidemic forecasting is critical for enabling precision containment strategies, highlighting its strategic importance for public health security. While recent advances in Large Language Models (LLMs) have demonstrated effectiveness as foundation models for domain-specific tasks, their potential for epidemic forecasting remains largely unexplored. In this paper, we introduce EpiLLM, a novel LLM-based framework tailored for spatio-temporal epidemic forecasting. Considering the key factors in real-world epidemic transmission: infection cases and human mobility, we introduce a dual-branch architecture to achieve fine-grained token-level alignment between such complex epidemic patterns and language tokens for LLM adaptation. To unleash the multi-step forecasting and generalization potential of LLM architectures, we propose an autoregressive modeling paradigm that reformulates the epidemic forecasting task into next-token prediction. To further enhance LLM perception of epidemics, we introduce spatio-temporal prompt learning techniques, which strengthen forecasting capabilities from a data-driven perspective. Extensive experiments show that EpiLLM significantly outperforms existing baselines on real-world COVID-19 datasets and exhibits scaling behavior characteristic of LLMs.
zh

[AI-76] Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

【速读】:该论文试图解决离线目标条件强化学习(Offline Goal-Conditioned Reinforcement Learning, GCRL)在长时序任务中表现不佳的问题。尽管已有方法如HIQL采用了分层策略结构,但在处理长时序任务时仍存在显著性能瓶颈。解决方案的关键在于改进价值函数以生成清晰的优势信号,从而提升高层策略的学习效果。为此,作者提出了一种简单而有效的方法——选项感知的时间抽象价值学习(Option-aware Temporally Abstracted value learning, OTA),通过将时间抽象引入时间差分学习过程,使价值更新具备选项感知能力,从而缩短有效时序长度,改善长时序场景下的优势估计。

链接: https://arxiv.org/abs/2505.12737
作者: Hongjoon Ahn,Heewoong Choi,Jisu Han,Taesup Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm where goal-reaching policies are trained from abundant unlabeled (reward-free) datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. By identifying the root cause of this challenge, we observe the following insights: First, performance bottlenecks mainly stem from the high-level policy’s inability to generate appropriate subgoals. Second, when learning the high-level policy in the long-horizon regime, the sign of the advantage signal frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage signal for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, the proposed learning scheme contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy extracted using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.
zh

[AI-77] SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

【速读】:该论文试图解决传统音频到图像生成方法在地理和环境上下文建模上的不足,即生成的图像与真实世界环境设置不一致的问题。其解决方案的关键在于提出一种新颖的地理上下文计算框架,该框架将地理知识显式地整合到多模态生成建模中,并构建了两个大规模的地理上下文多模态数据集SoundingSVI和SonicUrban。此外,论文还提出了基于Diffusion Transformer(DiT)的SounDiT模型,通过引入地理上下文场景条件来生成地理一致的景观图像,并设计了Place Similarity Score(PSS)评估框架以衡量输入声景与生成图像之间的一致性。

链接: https://arxiv.org/abs/2505.12734
作者: Junbo Wang,Haofeng Tan,Bowen Liao,Albert Jiang,Teng Fei,Qixing Huang,Zhengzhong Tu,Shan Ye,Yuhao Kang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:We present a novel and practically significant problem-Geo-Contextual Soundscape-to-Landscape (GeoS2L) generation-which aims to synthesize geographically realistic landscape images from environmental soundscapes. Prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. To address this limitation, we introduce a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling. We construct two large-scale geo-contextual multimodal datasets, SoundingSVI and SonicUrban, pairing diverse soundscapes with real-world landscape images. We propose SounDiT, a novel Diffusion Transformer (DiT)-based model that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), across element-, scene-, and human perception-levels to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. Our work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models, opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences.
zh

[AI-78] Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps

【速读】:该论文旨在解决自适应检索增强生成(Adaptive-RAG)方法中由于多轮检索结果存在大量重叠内容而导致的冗余计算问题,从而提升整体效率。其解决方案的关键在于引入一种与模型无关的方法,通过缓存访问和并行生成分别加速预填充和解码阶段,并结合指令驱动模块以更有效地引导模型关注内容的不同部分,从而减少冗余表示并提高计算效率。

链接: https://arxiv.org/abs/2505.12731
作者: Jie Ou,Jinyu Guo,Shuaihong Jiang,Zhaokun Wang,Libo Qin,Shunyu Yao,Wenhong Tian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a pivotal method for expanding the knowledge of large language models. To handle complex queries more effectively, researchers developed Adaptive-RAG (A-RAG) to enhance the generated quality through multiple interactions with external knowledge bases. Despite its effectiveness, A-RAG exacerbates the pre-existing efficiency challenges inherent in RAG, which are attributable to its reliance on multiple iterations of generation. Existing A-RAG approaches process all retrieved contents from scratch. However, they ignore the situation where there is a significant overlap in the content of the retrieval results across rounds. The overlapping content is redundantly represented, which leads to a large proportion of repeated computations, thus affecting the overall efficiency. To address this issue, this paper introduces a model-agnostic approach that can be generally applied to A-RAG methods, which is dedicated to reducing the redundant representation process caused by the overlapping of retrieval results. Specifically, we use cache access and parallel generation to speed up the prefilling and decoding stages respectively. Additionally, we also propose an instruction-driven module to further guide the model to more effectively attend to each part of the content in a more suitable way for LLMs. Experiments show that our approach achieves 2.79 and 2.33 times significant acceleration on average for prefilling and decoding respectively while maintaining equal generation quality.
zh

[AI-79] PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

【速读】:该论文试图解决当前在训练人类水平的具身智能体(embodied agents)过程中存在的数据瓶颈问题,即缺乏大规模、实时、多模态且具有社会互动性的数据集,这些数据集需反映自然环境中的感官-运动复杂性。其解决方案的关键在于提出PLAICraft,这是一个新型的数据采集平台和数据集,捕捉了多人在Minecraft中的交互行为,并涵盖了五种时间对齐的模态:视频、游戏输出音频、麦克风输入音频、鼠标和键盘操作。所有模态均以毫秒级时间精度记录,从而支持对同步、具身行为的研究。

链接: https://arxiv.org/abs/2505.12707
作者: Yingchen He,Christian D. Weilbach,Martyna E. Wojciechowska,Yuxuan Zhang,Frank Wood
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Advances in deep generative modelling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants.\footnoteWe have done a privacy review for the public release of an initial 200-hour subset of the dataset, with plans to release most of the dataset over time. Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.
zh

[AI-80] DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories

【速读】:该论文试图解决机器人策略在不同行为和环境中的泛化问题,即如何使机器人在未见过的环境中执行多种新任务。解决方案的关键在于提出一种四阶段流水线——DreamGen,该方法通过神经轨迹(neural trajectories)生成合成机器人数据,利用先进的图像到视频生成模型,适应目标机器人本体,生成逼真的合成视频,并通过潜在动作模型或逆动力学模型(IDM)恢复伪动作序列,从而实现高效的数据生成与策略训练。

链接: https://arxiv.org/abs/2505.12705
作者: Joel Jang,Seonghyeon Ye,Zongyu Lin,Jiannan Xiang,Johan Bjorck,Yu Fang,Fengyuan Hu,Spencer Huang,Kaushil Kundalia,Yen-Chen Lin,Loic Magne,Ajay Mandlekar,Avnish Narayan,You Liang Tan,Guanzhi Wang,Jing Wang,Qi Wang,Yinzhen Xu,Xiaohui Zeng,Kaiyuan Zheng,Ruijie Zheng,Ming-Yu Liu,Luke Zettlemoyer,Dieter Fox,Jan Kautz,Scott Reed,Yuke Zhu,Linxi Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: See website for videos: this https URL

点击查看摘要

Abstract:We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
zh

[AI-81] Counterfactual Explanations for Continuous Action Reinforcement Learning IJCAI

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在连续动作空间中缺乏可解释性的问题,特别是在医疗和机器人等高敏感领域应用时的可信度不足。其解决方案的关键在于提出一种新颖的方法,通过计算能够改善结果但与原始动作序列偏离最小的替代动作序列,从而生成反事实解释(counterfactual explanations)。该方法利用了针对连续动作的距离度量,并考虑了特定状态下的预定义策略约束,以提高解释的有效性、效率和泛化能力。

链接: https://arxiv.org/abs/2505.12701
作者: Shuyang Dong,Shangtong Zhang,Lu Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by International Joint Conference on Artificial Intelligence (IJCAI) 2025

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown great promise in domains like healthcare and robotics but often struggles with adoption due to its lack of interpretability. Counterfactual explanations, which address “what if” scenarios, provide a promising avenue for understanding RL decisions but remain underexplored for continuous action spaces. We propose a novel approach for generating counterfactual explanations in continuous action RL by computing alternative action sequences that improve outcomes while minimizing deviations from the original sequence. Our approach leverages a distance metric for continuous actions and accounts for constraints such as adhering to predefined policies in specific states. Evaluations in two RL domains, Diabetes Control and Lunar Lander, demonstrate the effectiveness, efficiency, and generalization of our approach, enabling more interpretable and trustworthy RL applications.
zh

[AI-82] owards Effective Federated Graph Foundation Model via Mitigating Knowledge Entanglement

【速读】:该论文旨在解决联邦图学习(Federated Graph Learning, FGL)中因数据和任务异质性导致的实用性受限,以及图基础模型(Graph Foundation Models, GFM)在单机训练中缺乏跨库数据与资源利用的问题。其解决方案的关键在于提出FedGFM+,通过两个核心模块减少知识纠缠:(1) AncDAI,一种基于全局锚点的领域感知初始化策略,通过局部图编码生成领域特定原型作为语义锚点,理论上证明这些原型在不同领域间可区分,提供强归纳偏置以解耦领域特定知识;(2) AdaDPP,一种本地自适应领域敏感提示池,在预训练过程中学习轻量级图提示捕捉领域语义,并在微调阶段通过提示池增强目标图属性,提升下游适应性。

链接: https://arxiv.org/abs/2505.12684
作者: Yinlin Zhu,Xunkai Li,Jishuo Jia,Miao Hu,Di Wu,Meikang Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
备注: Under Review

点击查看摘要

Abstract:Recent advances in graph machine learning have shifted to data-centric paradigms, driven by two emerging fields: (1) Federated graph learning (FGL) enables multi-client collaboration but faces challenges from data and task heterogeneity, limiting its practicality; (2) Graph foundation models (GFM) offer strong domain generalization but are usually trained on single machines, missing out on cross-silo data and resources. These paradigms are complementary, and their integration brings notable benefits. Motivated by this, we propose FedGFM, a novel decentralized GFM training paradigm. However, a key challenge is knowledge entanglement, where multi-domain knowledge merges into indistinguishable representations, hindering downstream adaptation. To address this, we present FedGFM+, an enhanced framework with two core modules to reduce knowledge entanglement: (1) AncDAI: A global anchor-based domain-aware initialization strategy. Before pre-training, each client encodes its local graph into domain-specific prototypes that serve as semantic anchors. Synthetic embeddings around these anchors initialize the global model. We theoretically prove these prototypes are distinguishable across domains, providing a strong inductive bias to disentangle domain-specific knowledge. (2) AdaDPP: A local adaptive domain-sensitive prompt pool. Each client learns a lightweight graph prompt capturing domain semantics during pre-training. During fine-tuning, prompts from all clients form a pool from which the GFM selects relevant prompts to augment target graph attributes, improving downstream adaptation. FedGFM+ is evaluated on 8 diverse benchmarks across multiple domains and tasks, outperforming 20 baselines from supervised learning, FGL, and federated GFM variants. Comments: Under Review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI) Cite as: arXiv:2505.12684 [cs.LG] (or arXiv:2505.12684v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.12684 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xunkai Li [view email] [v1] Mon, 19 May 2025 04:06:32 UTC (928 KB)
zh

[AI-83] xt2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment

【速读】:该论文试图解决符号化音乐生成过程中生成音乐与输入文本描述不一致的问题,从而提升生成作品的整体质量和连贯性。解决方案的关键在于在推理阶段引入文本-音频对齐和音乐结构对齐奖励机制,通过优化基于对齐的目标函数,包括衡量生成音乐与原始文本描述节奏一致性的文本-音频一致性得分,以及惩罚与调性不一致音符的和声一致性得分,使生成的音乐更贴合输入文本描述。

链接: https://arxiv.org/abs/2505.12669
作者: Abhinaba Roy,Geeta Puri,Dorien Herremans
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 7 pages, 1 figure, 5 tables

点击查看摘要

Abstract:We present Text2midi-InferAlign, a novel technique for improving symbolic music generation at inference time. Our method leverages text-to-audio alignment and music structural alignment rewards during inference to encourage the generated music to be consistent with the input caption. Specifically, we introduce two objectives scores: a text-audio consistency score that measures rhythmic alignment between the generated music and the original text caption, and a harmonic consistency score that penalizes generated music containing notes inconsistent with the key. By optimizing these alignment-based objectives during the generation process, our model produces symbolic music that is more closely tied to the input captions, thereby improving the overall quality and coherence of the generated compositions. Our approach can extend any existing autoregressive model without requiring further training or fine-tuning. We evaluate our work on top of Text2midi - an existing text-to-midi generation model, demonstrating significant improvements in both objective and subjective evaluation metrics.
zh

[AI-84] Web IP at Risk: Prevent Unauthorized Real-Time Retrieval by Large Language Models

【速读】:该论文试图解决网络知识产权(cyber Intellectual Property, IP)保护问题,特别是针对大型语言模型(Large Language Models, LLMs)在在线检索能力下对原创内容的未经授权实时提取。解决方案的关键在于利用LLMs自身的语义理解能力,为网络内容创作者提供一种新的防御框架,以有效应对难以处理的黑盒优化问题,从而显著提升防御成功率。

链接: https://arxiv.org/abs/2505.12655
作者: Yisheng Zhong,Yizhu Wen,Junfeng Guo,Mehran Kafai,Heng Huang,Hanqing Guo,Zhuangdi Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 13 figures, 4 tables

点击查看摘要

Abstract:Protecting cyber Intellectual Property (IP) such as web content is an increasingly critical concern. The rise of large language models (LLMs) with online retrieval capabilities presents a double-edged sword that enables convenient access to information but often undermines the rights of original content creators. As users increasingly rely on LLM-generated responses, they gradually diminish direct engagement with original information sources, significantly reducing the incentives for IP creators to contribute, and leading to a saturating cyberspace with more AI-generated content. In response, we propose a novel defense framework that empowers web content creators to safeguard their web-based IP from unauthorized LLM real-time extraction by leveraging the semantic understanding capability of LLMs themselves. Our method follows principled motivations and effectively addresses an intractable black-box optimization problem. Real-world experiments demonstrated that our methods improve defense success rates from 2.5% to 88.6% on different LLMs, outperforming traditional defenses such as configuration-based restrictions.
zh

[AI-85] textttDIAMONDs: A Dataset for mathbbDynamic mathbbInformation mathbbAnd mathbbMental modeling mathbbOf mathbbNumeric mathbbDiscussions

【速读】:该论文旨在解决多主体对话中理论心智(Theory of Mind, ToM)能力评估的问题,特别是针对对话中动态信息跟踪、知识不对称管理和相关信息区分等挑战。其解决方案的关键在于提出一种可扩展的方法论,用于生成高质量的对话-问题配对基准数据集,即\textttDIAMONDs,该数据集涵盖了商业、金融等群体互动场景,并包含需要数值推理的问题,从而能够精确评估模型在跟踪和推理参与者知识状态方面的ToM能力。

链接: https://arxiv.org/abs/2505.12651
作者: Sayontan Ghosh,Mahnaz Koupaee,Yash Kumar Lal,Pegah Alipoormolabashi,Mohammad Saqib Hasan,Jun Seok Kang,Niranjan Balasubramanian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding multiparty conversations demands robust Theory of Mind (ToM) capabilities, including the ability to track dynamic information, manage knowledge asymmetries, and distinguish relevant information across extended exchanges. To advance ToM evaluation in such settings, we present a carefully designed scalable methodology for generating high-quality benchmark conversation-question pairs with these characteristics. Using this methodology, we create \textttDIAMONDs , a new conversational QA dataset covering common business, financial or other group interactions. In these goal-oriented conversations, participants often have to track certain numerical quantities (say \textitexpected profit ) of interest that can be derived from other variable quantities (like \textitmarketing expenses, expected sales, salary , etc.), whose values also change over the course of the conversation. \textttDIAMONDs questions pose simple numerical reasoning problems over such quantities of interest (e.g., \textitfunds required for charity events, expected company profit next quarter , etc.) in the context of the information exchanged in conversations. This allows for precisely evaluating ToM capabilities for carefully tracking and reasoning over participants’ knowledge states. Our evaluation of state-of-the-art language models reveals significant challenges in handling participant-centric reasoning, specifically in situations where participants have false beliefs. Models also struggle with conversations containing distractors and show limited ability to identify scenarios with insufficient information. These findings highlight current models’ ToM limitations in handling real-world multi-party conversations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.12651 [cs.AI] (or arXiv:2505.12651v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.12651 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-86] Lightweight and Effective Preference Construction in PIBT for Large-Scale Multi-Agent Pathfinding

【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中由于代理在多个最优动作之间进行选择时产生的冲突问题,这些问题会显著影响最终解的质量。论文提出的解决方案关键在于改进PIBT(Prioritized Iterative Best First Search)算法中的冲突解决策略,通过两种有效的技巧进行决策:一是让代理在行动时考虑是否会影响后续步骤的进展,从而智能地避开其他代理;二是通过多次运行PIBT学习动作对其他代理造成的遗憾,并利用此信息最小化整体遗憾。这些方法在不牺牲PIBT计算优势的前提下,有效降低了单次MAPF问题的总成本并提升了长期任务(lifelong MAPF)的吞吐量。

链接: https://arxiv.org/abs/2505.12623
作者: Keisuke Okumura,Hiroki Nagai
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: To be presented at SoCS-25

点击查看摘要

Abstract:PIBT is a computationally lightweight algorithm that can be applied to a variety of multi-agent pathfinding (MAPF) problems, generating the next collision-free locations of agents given another. Because of its simplicity and scalability, it is becoming a popular underlying scheme for recent large-scale MAPF methods involving several hundreds or thousands of agents. Vanilla PIBT makes agents behave greedily towards their assigned goals, while agents typically have multiple best actions, since the graph shortest path is not always unique. Consequently, tiebreaking about how to choose between these actions significantly affects resulting solutions. This paper studies two simple yet effective techniques for tiebreaking in PIBT, without compromising its computational advantage. The first technique allows an agent to intelligently dodge another, taking into account whether each action will hinder the progress of the next timestep. The second technique is to learn, through multiple PIBT runs, how an action causes regret in others and to use this information to minimise regret collectively. Our empirical results demonstrate that these techniques can reduce the solution cost of one-shot MAPF and improve the throughput of lifelong MAPF. For instance, in densely populated one-shot cases, the combined use of these tiebreaks achieves improvements of around 10-20% in sum-of-costs, without significantly compromising the speed of a PIBT-based planner.
zh

[AI-87] Learning Robust Spectral Dynamics for Temporal Domain Generalization

【速读】:该论文旨在解决动态环境中机器学习模型性能下降的问题,特别是在存在时间分布偏移(概念漂移)的情况下。现有方法通常假设变化是平滑的增量过程,难以应对现实世界中包含长期结构(如渐进演化和周期性)和局部不确定性的复杂漂移。解决方案的关键在于提出FreKoo,其通过参数轨迹的频域分析来应对这些挑战,利用傅里叶变换将参数演化分解为不同的频谱带,其中低频成分通过Koopman算子进行学习与外推,以稳健地捕捉多种漂移模式,而高频扰动则通过针对性的时间正则化进行平滑处理,从而防止对瞬时噪声和领域不确定性过拟合。

链接: https://arxiv.org/abs/2505.12585
作者: En Yu,Jie Lu,Xiaoyu Yang,Guangquan Zhang,Zhen Fang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern machine learning models struggle to maintain performance in dynamic environments where temporal distribution shifts, \emphi.e., concept drift, are prevalent. Temporal Domain Generalization (TDG) seeks to enable model generalization across evolving domains, yet existing approaches typically assume smooth incremental changes, struggling with complex real-world drifts involving long-term structure (incremental evolution/periodicity) and local uncertainties. To overcome these limitations, we introduce FreKoo, which tackles these challenges via a novel frequency-domain analysis of parameter trajectories. It leverages the Fourier transform to disentangle parameter evolution into distinct spectral bands. Specifically, low-frequency component with dominant dynamics are learned and extrapolated using the Koopman operator, robustly capturing diverse drift patterns including both incremental and periodicity. Simultaneously, potentially disruptive high-frequency variations are smoothed via targeted temporal regularization, preventing overfitting to transient noise and domain uncertainties. In addition, this dual spectral strategy is rigorously grounded through theoretical analysis, providing stability guarantees for the Koopman prediction, a principled Bayesian justification for the high-frequency regularization, and culminating in a multiscale generalization bound connecting spectral dynamics to improved generalization. Extensive experiments demonstrate FreKoo’s significant superiority over SOTA TDG approaches, particularly excelling in real-world streaming scenarios with complex drifts and uncertainties.
zh

[AI-88] A Comprehensive Survey on Physical Risk Control in the Era of Foundation Model-enabled Robotics IJCAI2025

【速读】:该论文试图解决生成式 AI (Generative AI) 驱动的机器人(FMRs)在物理世界中部署时所面临的物理风险问题,尤其是其行为对人类和周围物体安全的影响。解决方案的关键在于通过覆盖 FMRs 全生命周期的控制方法,包括预部署阶段、事故前阶段和事故后阶段,系统性地减轻潜在风险,并强调在事故前阶段的风险缓解策略、与人类的物理交互研究以及基础模型本身的核心问题。

链接: https://arxiv.org/abs/2505.12583
作者: Takeshi Kojima,Yaonan Zhu,Yusuke Iwasawa,Toshinori Kitamura,Gang Yan,Shu Morikuni,Ryosuke Takanami,Alfredo Solano,Tatsuya Matsushima,Akiko Murakami,Yutaka Matsuo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IJCAI 2025 Survey Track

点击查看摘要

Abstract:Recent Foundation Model-enabled robotics (FMRs) display greatly improved general-purpose skills, enabling more adaptable automation than conventional robotics. Their ability to handle diverse tasks thus creates new opportunities to replace human labor. However, unlike general foundation models, FMRs interact with the physical world, where their actions directly affect the safety of humans and surrounding objects, requiring careful deployment and control. Based on this proposition, our survey comprehensively summarizes robot control approaches to mitigate physical risks by covering all the lifespan of FMRs ranging from pre-deployment to post-accident stage. Specifically, we broadly divide the timeline into the following three phases: (1) pre-deployment phase, (2) pre-incident phase, and (3) post-incident phase. Throughout this survey, we find that there is much room to study (i) pre-incident risk mitigation strategies, (ii) research that assumes physical interaction with humans, and (iii) essential issues of foundation models themselves. We hope that this survey will be a milestone in providing a high-resolution analysis of the physical risks of FMRs and their control, contributing to the realization of a good human-robot relationship.
zh

[AI-89] AdaDim: Dimensionality Adaptation for SSL Representational Dynamics

【速读】:该论文试图解决自监督学习(Self-Supervised Learning, SSL)中维度坍缩(dimensional collapse)的问题,即高维表示空间退化为低维子空间,从而影响模型性能。现有方法通过优化维度对比或样本对比策略来提升表示的维度和分布特性,同时利用投影头降低表示与目标之间的互信息。然而,当前研究缺乏对训练过程中影响这两个指标(熵 H® 和互信息 I(R;Z))动态机制的理解,以及它们与下游任务性能的关系。论文的关键解决方案是提出 AdaDim 方法,通过自适应地权衡特征去相关性和样本均匀分布的损失,以捕捉训练过程中的动态变化,从而在 H® 和 I(R;Z) 之间达到最优平衡,提升 SSL 模型性能。

链接: https://arxiv.org/abs/2505.12576
作者: Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:A key factor in effective Self-Supervised learning (SSL) is preventing dimensional collapse, which is where higher-dimensional representation spaces span a lower-dimensional subspace. Therefore, SSL optimization strategies involve guiding a model to produce representations ( R ) with a higher dimensionality. Dimensionality is either optimized through a dimension-contrastive approach that encourages feature decorrelation or through a sample-contrastive method that promotes a uniform spread of sample representations. Both families of SSL algorithms also utilize a projection head that maps R into a lower-dimensional embedding space Z . Recent work has characterized the projection head as a filter of irrelevant features from the SSL objective by reducing mutual information, I(R;Z) . Therefore, the current literature’s view is that a good SSL representation space should have a high H® and a low I(R;Z) . However, this view of the problem is lacking in terms of an understanding of the underlying training dynamics that influences both terms, as well as how the values of H® and I(R;Z) arrived at the end of training reflect the downstream performance of an SSL model. We address both gaps in the literature by demonstrating that increases in H® due to feature decorrelation at the start of training lead to a higher I(R;Z) , while increases in H® due to samples distributing uniformly in a high-dimensional space at the end of training cause I(R;Z) to plateau or decrease. Furthermore, our analysis shows that the best performing SSL models do not have the highest H® nor the lowest I(R;Z) , but arrive at an optimal intermediate point for both. We develop a method called AdaDim to exploit these observed training dynamics by adaptively weighting between losses based on feature decorrelation and uniform sample spread.
zh

[AI-90] RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

【速读】:该论文试图解决现有评估大型语言模型(Large Language Models, LLMs)数学推理能力的基准测试主要依赖竞赛题目、形式化证明或人工设计的难题,无法反映实际研究环境中数学的本质问题。其解决方案的关键在于引入RealMath,这是一个直接从研究论文和数学论坛中提取的新型基准,用于评估LLMs在真实数学任务上的表现。该方法解决了三个关键挑战:获取多样化的研究级内容、通过可验证陈述实现可靠的自动化评估,以及设计一个可持续更新的数据集以降低数据污染风险。

链接: https://arxiv.org/abs/2505.12575
作者: Jie Zhang,Cezara Petrui,Kristina Nikolić,Florian Tramèr
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions – failing to capture the nature of mathematics encountered in actual research environments. We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs’ abilities on authentic mathematical tasks. Our approach addresses three critical challenges: sourcing diverse research-level content, enabling reliable automated evaluation through verifiable statements, and designing a continually refreshable dataset to mitigate contamination risks. Experimental results across multiple LLMs reveal surprising capabilities in handling research mathematics compared to competition problems, suggesting current models may already serve as valuable assistants for working mathematicians despite limitations on highly challenging problems. The code and dataset for RealMath are publicly available.
zh

[AI-91] A Survey of Attacks on Large Language Models

【速读】:该论文试图解决大规模语言模型(Large Language Models, LLMs)及其基于LLM的智能体在实际应用中所面临的严重安全与可靠性风险问题,包括恶意滥用、隐私泄露和服务中断等,这些问题削弱了用户信任并威胁社会安全。其解决方案的关键在于系统性地概述针对LLM和LLM-based agents的对抗性攻击,将这些攻击划分为三个阶段:训练阶段攻击、推理阶段攻击以及可用性与完整性攻击,并对每个阶段中的代表性攻击方法及其防御措施进行深入分析,以期为研究人员提供全面的安全视角和应对策略。

链接: https://arxiv.org/abs/2505.12567
作者: Wenrui Xu,Keshab K. Parhi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and LLM-based agents have been widely deployed in a wide range of applications in the real world, including healthcare diagnostics, financial analysis, customer support, robotics, and autonomous driving, expanding their powerful capability of understanding, reasoning, and generating natural languages. However, the wide deployment of LLM-based applications exposes critical security and reliability risks, such as the potential for malicious misuse, privacy leakage, and service disruption that weaken user trust and undermine societal safety. This paper provides a systematic overview of the details of adversarial attacks targeting both LLMs and LLM-based agents. These attacks are organized into three phases in LLMs: Training-Phase Attacks, Inference-Phase Attacks, and Availability Integrity Attacks. For each phase, we analyze the details of representative and recently introduced attack methods along with their corresponding defenses. We hope our survey will provide a good tutorial and a comprehensive understanding of LLM security, especially for attacks on LLMs. We desire to raise attention to the risks inherent in widely deployed LLM-based applications and highlight the urgent need for robust mitigation strategies for evolving threats.
zh

[AI-92] Beyond Accuracy: EcoL2 Metric for Sustainable Neural PDE Solvers

【速读】:该论文试图解决深度学习驱动的偏微分方程(PDE)求解器在计算过程中产生的高碳排放问题,而当前研究主要关注模型精度的提升。解决方案的关键在于提出一种新的碳排放度量指标——EcoL2,该指标在数据收集、模型训练和部署过程中平衡了模型准确性与碳排放量,从而实现了对模型性能和环境成本的全面评估。

链接: https://arxiv.org/abs/2505.12556
作者: Taniya Kapoor,Abhishek Chandra,Anastasios Stamou,Stephen J Roberts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world systems, from aerospace to railway engineering, are modeled with partial differential equations (PDEs) describing the physics of the system. Estimating robust solutions for such problems is essential. Deep learning-based architectures, such as neural PDE solvers, have recently gained traction as a reliable solution method. The current state of development of these approaches, however, primarily focuses on improving accuracy. The environmental impact of excessive computation, leading to increased carbon emissions, has largely been overlooked. This paper introduces a carbon emission measure for a range of PDE solvers. Our proposed metric, EcoL2, balances model accuracy with emissions across data collection, model training, and deployment. Experiments across both physics-informed machine learning and operator learning architectures demonstrate that the proposed metric presents a holistic assessment of model performance and emission cost. As such solvers grow in scale and deployment, EcoL2 represents a step toward building performant scientific machine learning systems with lower long-term environmental impact.
zh

[AI-93] owards Budget-Friendly Model-Agnostic Explanation Generation for Large Language Models

【速读】:该论文旨在解决大型语言模型(Large language models, LLMs)预测解释的挑战,尤其是在模型架构多样且部分模型为闭源的情况下,如何高效生成可信的解释。现有模型无关(model-agnostic)方法需要多次调用LLMs以获取足够的样本生成可靠解释,导致经济成本高昂。论文提出的解决方案关键在于通过从成本较低的模型中采样,生成可信的解释,并证明此类代理解释在下游任务中表现良好,从而为模型无关的解释方法提供了一种新的范式。

链接: https://arxiv.org/abs/2505.12509
作者: Junhao Liu,Haonan Yu,Xin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With Large language models (LLMs) becoming increasingly prevalent in various applications, the need for interpreting their predictions has become a critical challenge. As LLMs vary in architecture and some are closed-sourced, model-agnostic techniques show great promise without requiring access to the model’s internal parameters. However, existing model-agnostic techniques need to invoke LLMs many times to gain sufficient samples for generating faithful explanations, which leads to high economic costs. In this paper, we show that it is practical to generate faithful explanations for large-scale LLMs by sampling from some budget-friendly models through a series of empirical studies. Moreover, we show that such proxy explanations also perform well on downstream tasks. Our analysis provides a new paradigm of model-agnostic explanation methods for LLMs, by including information from budget-friendly models.
zh

[AI-94] Unsupervised Invariant Risk Minimization

【速读】:该论文试图解决在无标签数据情况下实现对分布变化具有鲁棒性的表示学习问题,传统Invariance Risk Minimization (IRM)方法依赖于有标签数据来学习跨环境分布变化的鲁棒表示,而本文提出了一种无监督框架,通过特征分布对齐重新定义了不变性,关键在于利用无监督结构因果模型,支持环境条件下的样本生成与干预,并引入了两种方法:线性方法Principal Invariant Component Analysis (PICA)和深度生成模型Variational Invariant Autoencoder (VIAE),以提取不变方向并解耦环境不变与环境相关的潜在因子。

链接: https://arxiv.org/abs/2505.12506
作者: Yotam Norman,Ron Meir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel unsupervised framework for \emphInvariant Risk Minimization (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that disentangles environment-invariant and environment-dependent latent factors. Our approach is based on a novel ``unsupervised’’ structural causal model and supports environment-conditioned sample-generation and intervention. Empirical evaluations on synthetic dataset and modified versions of MNIST demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.
zh

[AI-95] CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

【速读】:该论文旨在解决基于规则的强化学习(Rule-based Reinforcement Learning, RL)在语言模型(Language Models, LMs)训练过程中出现的训练不稳定问题,尤其是由于大策略更新和不当裁剪导致的训练崩溃。其解决方案的关键在于提出了一种名为Clipped Policy Gradient Optimization with Policy Drift (CPGD)的新算法,该算法通过基于KL散度的策略漂移约束动态正则化策略更新,并在对数比例上引入裁剪机制以防止过度策略更新,从而提升训练稳定性并改善性能。

链接: https://arxiv.org/abs/2505.12504
作者: Zongkai Liu,Fanqing Meng,Lingxiao Du,Zhixiang Zhou,Chao Yu,Wenqi Shao,Qiaosheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods – such as GRPO, REINFORCE++, and RLOO – often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at this https URL.
zh

[AI-96] ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在事务型规划任务中表现不足的问题,特别是在需要ACID-like保证和实时中断恢复的场景下。解决方案的关键在于提出自适应LLM代理系统(Adaptive LLM Agent System, ALAS),通过将每个计划分解为角色专用代理、赋予其自动状态跟踪能力,并通过轻量级协议进行协调,从而克服LLMs在自我验证、上下文侵蚀、下一标记短视和缺乏持久状态等方面的缺陷。当发生中断时,代理采用基于历史的局部补偿机制,避免了高成本的全局重规划并限制了连锁效应。

链接: https://arxiv.org/abs/2505.12501
作者: Edward Y. Chang,Longling Geng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages, 10 figures, 19 tables

点击查看摘要

Abstract:Large language models (LLMs) excel at rapid generation of text and multimodal content, yet they falter on transaction-style planning that demands ACID-like guarantees and real-time disruption recovery. We present Adaptive LLM Agent System (ALAS), a framework that tackles four fundamental LLM deficits: (i) absence of self-verification, (ii) context erosion, (iii) next-token myopia, and (iv) lack of persistent state. ALAS decomposes each plan into role-specialized agents, equips them with automatic state tracking, and coordinates them through a lightweight protocol. When disruptions arise, agents apply history-aware local compensation, avoiding costly global replanning and containing cascade effects. On real-world, large-scale job-shop scheduling benchmarks, ALAS sets new best results for static sequential planning and excels in dynamic reactive scenarios with unexpected disruptions. These gains show that principled modularization plus targeted compensation can unlock scalable and resilient planning with LLMs.
zh

[AI-97] MARGE: Improving Math Reasoning for LLM s with Guided Exploration ICML2025

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理中因高质量查询不足而导致的性能受限问题,以及现有方法在自生成数据过程中因无效探索导致的虚假相关数据问题。解决方案的关键在于提出MARGE(Improving Math Reasoning with Guided Exploration),该方法通过系统性地探索自生成解法中的中间推理状态,实现充分的探索和改进的信用分配,从而提升数学推理能力。

链接: https://arxiv.org/abs/2505.12500
作者: Jingyue Gao,Runji Lin,Keming Lu,Bowen Yu,Junyang Lin,Jianyu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear at ICML 2025

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of high-quality queries. This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages. To address such challenge, we introduce \textbfMARGE: Improving \textbfMath \textbfReasoning with \textbfGuided \textbfExploration, a novel method to address this issue and enhance mathematical reasoning through hit-guided exploration. MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process. Through extensive experiments across multiple backbone models and benchmarks, we demonstrate that MARGE significantly improves reasoning capabilities without requiring external annotations or training additional value models. Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods. These results demonstrate MARGE’s effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data. Our code and models are available at \hrefthis https URLthis link.
zh

[AI-98] UIShift: Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning

【速读】:该论文旨在解决传统监督微调(SFT)在训练视觉语言模型(VLMs)用于GUI代理时所面临的标注数据收集成本高、耗时且易出错的问题。其解决方案的关键在于提出一种自监督逆动力学任务,通过从GUI状态转移对中推断导致该转移的动作,使VLM能够学习到与用户操作相关的有效特征,同时避免无关变化的干扰,并且无需人工标注即可利用现有GUI轨迹数据进行训练,从而实现高效的数据扩展和模型优化。

链接: https://arxiv.org/abs/2505.12493
作者: Longxi Gao,Li Zhang,Mengwei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training effective Vision Language Models (VLMs) for GUI agents typically relies on supervised fine-tuning (SFT) over large-scale annotated datasets, where the collection process is labor-intensive and error-prone. In this work, we propose a self-supervised inverse dynamics task to enable VLMs to learn from GUI transition pairs by inferring the action that caused that transition. This training task offers two advantages: (1) It enables VLMs to ignore variations unrelated to user actions (e.g., background refreshes, ads) and to focus on true affordances such as buttons and input fields within complex GUIs. (2) The training data can be easily obtained from existing GUI trajectories without requiring human annotation, and it can be easily scaled through automatic offline exploration. Using this training task, we propose UI-shift, a framework for enhancing VLM-based GUI agents through self-supervised reinforcement learning (RL). With only 2K training samples sourced from existing datasets, two VLMs – Qwen2.5-VL-3B and Qwen2.5-VL-7B – trained with UI-Shift achieve competitive or superior performance on grounding tasks (ScreenSpot-series benchmarks) and GUI automation tasks (AndroidControl), compared to SFT baselines and GUI-specific models that explicitly elicit reasoning abilities during RL. Our findings suggest a potential direction for enhancing VLMs for GUI agents by leveraging more self-supervised training data in the future.
zh

[AI-99] Unleashing Automated Congestion Control Customization in the Wild

【速读】:该论文旨在解决传统拥塞控制(Congestion Control, CC)算法在多样化服务需求和网络条件下难以保持高性能的问题。其解决方案的关键在于开发一个能够根据具体服务需求和网络状况自动定制拥塞控制逻辑的系统,该系统基于PCC Vivace这一在线学习(online-learning)机制的拥塞控制协议,通过实际案例验证了其性能优势,并讨论了在真实部署中对PCC Vivace的适应性修改与经验教训。

链接: https://arxiv.org/abs/2505.12492
作者: Amit Cohen,Lev Gloukhenki,Ravid Hadar,Eden Itah,Yehuda Shvut,Michael Schapira
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Congestion control (CC) crucially impacts user experience across Internet services like streaming, gaming, AR/VR, and connected cars. Traditionally, CC algorithm design seeks universal control rules that yield high performance across diverse application domains and networks. However, varying service needs and network conditions challenge this approach. We share operational experience with a system that automatically customizes congestion control logic to service needs and network conditions. We discuss design, deployment challenges, and solutions, highlighting performance benefits through case studies in streaming, gaming, connected cars, and more. Our system leverages PCC Vivace, an online-learning based congestion control protocol developed by researchers. Hence, along with insights from customizing congestion control, we also discuss lessons learned and modifications made to adapt PCC Vivace for real-world deployment. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY) Cite as: arXiv:2505.12492 [cs.NI] (or arXiv:2505.12492v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2505.12492 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-100] NeuroGen: Neural Network Parameter Generation via Large Language Models

【速读】:该论文试图解决神经网络(Neural Network, NN)参数获取的问题,这是机器学习自神经网络诞生以来最重要的问题之一。传统方法如反向传播和仅前向优化通过迭代数据拟合逐步优化参数,而本文提出了一种新方向:通过大型语言模型(Large Language Model, LLM)生成神经网络参数。解决方案的关键在于提出NeuroGen,这是一种基于数据、任务和网络架构描述的通用且易于实现的两阶段方法。第一阶段为参数参考知识注入,通过在神经网络检查点上预训练LLMs以建立对参数空间的基础理解;第二阶段为上下文增强指令微调,使LLMs通过丰富且任务感知的提示适应特定任务。

链接: https://arxiv.org/abs/2505.12470
作者: Jiaqi Wang,Yusen Zhang,Xi Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The three authors contributed equally to this work. The codes will be public after being accepted

点击查看摘要

Abstract:Acquiring the parameters of neural networks (NNs) has been one of the most important problems in machine learning since the inception of NNs. Traditional approaches, such as backpropagation and forward-only optimization, acquire parameters via iterative data fitting to gradually optimize them. This paper aims to explore the feasibility of a new direction: acquiring NN parameters via large language model generation. We propose NeuroGen, a generalized and easy-to-implement two-stage approach for NN parameter generation conditioned on descriptions of the data, task, and network architecture. Stage one is Parameter Reference Knowledge Injection, where LLMs are pretrained on NN checkpoints to build foundational understanding of parameter space, whereas stage two is Context-Enhanced Instruction Tuning, enabling LLMs to adapt to specific tasks through enriched, task-aware prompts. Experimental results demonstrate that NeuroGen effectively generates usable NN parameters. Our findings highlight the feasibility of LLM-based NN parameter generation and suggest a promising new paradigm where LLMs and lightweight NNs can coexist synergistically
zh

[AI-101] Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems ACL2025

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)驱动应用中多智能体协作的性能与可扩展性问题,特别是针对协作策略中的细粒度机制缺乏系统研究的问题。其解决方案的关键在于通过系统性地分析协作策略的四个维度——智能体治理、参与控制、交互动态和对话历史管理,并在两种依赖上下文的场景下进行严格实验,量化这些策略对任务准确性和计算效率的影响,最终提出一种基于令牌-准确率比(Token-Accuracy Ratio, TAR)的优化框架,以平衡决策质量和资源利用。

链接: https://arxiv.org/abs/2505.12467
作者: Haochun Wang,Sendong Zhao,Jingbo Wang,Zewen Qiang,Bing Qin,Ting Liu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: ACL 2025

点击查看摘要

Abstract:Multi-agent collaboration has emerged as a pivotal paradigm for addressing complex, distributed tasks in large language model (LLM)-driven applications. While prior research has focused on high-level architectural frameworks, the granular mechanisms governing agents, critical to performance and scalability, remain underexplored. This study systematically investigates four dimensions of collaboration strategies: (1) agent governance, (2) participation control, (3) interaction dynamics, and (4) dialogue history management. Through rigorous experimentation under two context-dependent scenarios: Distributed Evidence Integration (DEI) and Structured Evidence Synthesis (SES), we quantify the impact of these strategies on both task accuracy and computational efficiency. Our findings reveal that centralized governance, instructor-led participation, ordered interaction patterns, and instructor-curated context summarization collectively optimize the trade-off between decision quality and resource utilization with the support of the proposed Token-Accuracy Ratio (TAR). This work establishes a foundation for designing adaptive, scalable multi-agent systems, shifting the focus from structural novelty to strategic interaction mechanics.
zh

[AI-102] Model Discovery with Grammatical Evolution. An Experiment with Prime Numbers

【速读】:该论文试图解决机器学习模型缺乏透明性和可解释性的问题,即传统基于输入-输出数据的决策和预测模型(如决策树或神经网络)难以提供可理解的数学公式形式的解析模型。解决方案的关键在于利用语法进化(Grammatical Evolution)技术来发现具有透明性、简洁性且组件和结构可读的解析模型。

链接: https://arxiv.org/abs/2505.12440
作者: Jakub Skrzyński,Dominik Sepioło,Antoni Ligęza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented during 5th Polish Conference on Artificial Intelligence, published in “PROGRESS IN POLISH ARTIFICIAL INTELLIGENCE RESEARCH 5” ISBN 978-83-8156-696-4

点击查看摘要

Abstract:Machine Learning produces efficient decision and prediction models based on input-output data only. Such models have the form of decision trees or neural nets and are far from transparent analytical models, based on mathematical formulas. Analytical model discovery requires additional knowledge and may be performed with Grammatical Evolution. Such models are transparent, concise, and have readable components and structure. This paper reports on a non-trivial experiment with generating such models.
zh

[AI-103] Addressing the Scarcity of Benchmarks for Graph XAI

【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在图分类任务中可解释性不足的问题,即其决策过程对终端用户不透明,限制了其在安全关键型应用中的部署。为了解决这一问题,研究者提出了一个通用方法,用于从真实世界数据集中自动构建可解释人工智能(Explainable Artificial Intelligence, XAI)基准测试集。该解决方案的关键在于通过自动化手段生成大量具有已知真实子图特征的基准数据,从而克服现有XAI基准数据集在真实性和规模上的局限性。

链接: https://arxiv.org/abs/2505.12437
作者: Michele Fontanesi,Alessio Micheli,Marco Podda,Domenico Tortorella
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Graph Neural Networks (GNNs) have become the de facto model for learning from structured data, their decisional process remains opaque to the end user, undermining their deployment in safety-critical applications. In the case of graph classification, Explainable Artificial Intelligence (XAI) techniques address this major issue by identifying sub-graph motifs that explain predictions. However, advancements in this field are hindered by a chronic scarcity of benchmark datasets with known ground-truth motifs to assess the explanations’ quality. Current graph XAI benchmarks are limited to synthetic data or a handful of real-world tasks hand-curated by domain experts. In this paper, we propose a general method to automate the construction of XAI benchmarks for graph classification from real-world datasets. We provide both 15 ready-made benchmarks, as well as the code to generate more than 2000 additional XAI benchmarks with our method. As a use case, we employ our benchmarks to assess the effectiveness of some popular graph explainers.
zh

[AI-104] SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment ACL’25

【速读】:该论文试图解决直接偏好优化(Direct Preference Optimization, DPO)在生成符合人类偏好的响应方面能力有限以及结果不够稳健的问题。其解决方案的关键在于提出一种新颖的自引导直接偏好优化算法(Self-Guided Direct Preference Optimization, SGDPO),该算法通过引入一个引导项来调控优化过程中的梯度流动,从而实现对选择和拒绝奖励更新的细粒度控制。

链接: https://arxiv.org/abs/2505.12435
作者: Wenqiao Zhu,Ji Liu,Lulu Wang,Jun Wu,Yulun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 18 pages, to appear in ACL’25

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is broadly utilized for aligning Large Language Models (LLMs) with human values because of its flexibility. Despite its effectiveness, it has been observed that the capability of DPO to generate human-preferred response is limited and the results of DPO are far from resilient. To address these limitations, in this paper we propose a novel Self-Guided Direct Preference Optimization algorithm, i.e., SGDPO, which incorporates a pilot term to steer the gradient flow during the optimization process, allowing for fine-grained control over the updates of chosen and rejected rewards. We provide a detailed theoretical analysis of our proposed method and elucidate its operational mechanism. Furthermore, we conduct comprehensive experiments on various models and benchmarks. The extensive experimental results demonstrate the consistency between the empirical results and our theoretical analysis and confirm the effectiveness of our proposed approach (up to 9.19% higher score).
zh

[AI-105] EvoGPT : Enhancing Test Suite Robustness via LLM -Based Generation and Genetic Optimization

【速读】:该论文试图解决自动化单元测试生成中测试用例多样性不足与故障检测能力有限的问题。其解决方案的关键在于提出一种混合框架EvoGPT,该框架将基于大型语言模型(Large Language Models, LLMs)的测试生成与进化搜索技术相结合,通过多样化的温度采样生成初始测试用例,随后利用生成-修复循环和覆盖率引导的断言增强优化测试用例,并通过遗传算法进化测试套件,以突变分数为优先的适应度函数提升测试效果,从而实现更有效且鲁棒的测试用例生成。

链接: https://arxiv.org/abs/2505.12424
作者: Lior Broide,Roni Stern
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently emerged as promising tools for automated unit test generation. We introduce a hybrid framework called EvoGPT that integrates LLM-based test generation with evolutionary search techniques to create diverse, fault-revealing unit tests. Unit tests are initially generated with diverse temperature sampling to maximize behavioral and test suite diversity, followed by a generation-repair loop and coverage-guided assertion enhancement. The resulting test suites are evolved using genetic algorithms, guided by a fitness function prioritizing mutation score over traditional coverage metrics. This design emphasizes the primary objective of unit testing-fault detection. Evaluated on multiple open-source Java projects, EvoGPT achieves an average improvement of 10% in both code coverage and mutation score compared to LLMs and traditional search-based software testing baselines. These results demonstrate that combining LLM-driven diversity, targeted repair, and evolutionary optimization produces more effective and resilient test suites.
zh

[AI-106] Fixed Point Explainability

【速读】:该论文试图解决模型与其解释器之间交互稳定性的评估问题,旨在揭示模型的隐藏行为和解释方法的缺陷。解决方案的关键在于引入一种基于“为什么回归”原则的固定点解释(fixed point explanations)形式化概念,并通过递归应用来评估这种稳定性,该方法满足最小性、稳定性和忠实性等属性。

链接: https://arxiv.org/abs/2505.12421
作者: Emanuele La Malfa,Jon Vadillo,Marco Molinari,Michael Wooldridge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:This paper introduces a formal notion of fixed point explanations, inspired by the “why regress” principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results.
zh

[AI-107] Hyperbolic Residual Quantization: Discrete Representations for Data with Latent Hierarchies

【速读】:该论文旨在解决传统残差量化(Residual Quantization, RQ)在处理具有潜在层次结构的数据时存在的局限性,特别是在建模层次分支方面存在根本性的不匹配问题。其解决方案的关键在于提出双曲残差量化(Hyperbolic Residual Quantization, HRQ),通过将数据原生嵌入双曲流形中,并利用双曲运算和距离度量进行残差量化,从而赋予模型与层次结构自然对齐的归纳偏置,提升对层次数据的离散表示能力。

链接: https://arxiv.org/abs/2505.12404
作者: Piotr Piękos,Subhradeep Kayal,Alexandros Karatzoglou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hierarchical data arise in countless domains, from biological taxonomies and organizational charts to legal codes and knowledge graphs. Residual Quantization (RQ) is widely used to generate discrete, multitoken representations for such data by iteratively quantizing residuals in a multilevel codebook. However, its reliance on Euclidean geometry can introduce fundamental mismatches that hinder modeling of hierarchical branching, necessary for faithful representation of hierarchical data. In this work, we propose Hyperbolic Residual Quantization (HRQ), which embeds data natively in a hyperbolic manifold and performs residual quantization using hyperbolic operations and distance metrics. By adapting the embedding network, residual computation, and distance metric to hyperbolic geometry, HRQ imparts an inductive bias that aligns naturally with hierarchical branching. We claim that HRQ in comparison to RQ can generate more useful for downstream tasks discrete hierarchical representations for data with latent hierarchies. We evaluate HRQ on two tasks: supervised hierarchy modeling using WordNet hypernym trees, where the model is supervised to learn the latent hierarchy - and hierarchy discovery, where, while latent hierarchy exists in the data, the model is not directly trained or evaluated on a task related to the hierarchy. Across both scenarios, HRQ hierarchical tokens yield better performance on downstream tasks compared to Euclidean RQ with gains of up to 20% for the hierarchy modeling task. Our results demonstrate that integrating hyperbolic geometry into discrete representation learning substantially enhances the ability to capture latent hierarchies.
zh

[AI-108] Few-Shot Concept Unlearning with Low Rank Adaptation

【速读】:该论文试图解决图像生成模型在生成过程中可能产生敏感图像数据的问题,这些问题可能威胁隐私或违反版权法规。其解决方案的关键在于通过更新文本编码器最终层的梯度,从扩散模型中移除特定概念的影响。该方法利用加权损失函数结合文本反转和低秩技术,对稳定扩散模型的文本编码器进行反向传播优化,从而在文本-图像嵌入空间中消除特定概念的影响,使模型在被提示时无法生成包含该概念的图像。

链接: https://arxiv.org/abs/2505.12395
作者: Udaya Shreyas,L.N. Aadarsh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image Generation models are a trending topic nowadays, with many people utilizing Artificial Intelligence models in order to generate images. There are many such models which, given a prompt of a text, will generate an image which depicts said prompt. There are many image generation models, such as Latent Diffusion Models, Denoising Diffusion Probabilistic Models, Generative Adversarial Networks and many more. When generating images, these models can generate sensitive image data, which can be threatening to privacy or may violate copyright laws of private entities. Machine unlearning aims at removing the influence of specific data subsets from the trained models and in the case of image generation models, remove the influence of a concept such that the model is unable to generate said images of the concept when prompted. Conventional retraining of the model can take upto days, hence fast algorithms are the need of the hour. In this paper we propose an algorithm that aims to remove the influence of concepts in diffusion models through updating the gradients of the final layers of the text encoders. Using a weighted loss function, we utilize backpropagation in order to update the weights of the final layers of the Text Encoder componet of the Stable Diffusion Model, removing influence of the concept from the text-image embedding space, such that when prompted, the result is an image not containing the concept. The weighted loss function makes use of Textual Inversion and Low-Rank this http URL perform our experiments on Latent Diffusion Models, namely the Stable Diffusion v2 model, with an average concept unlearning runtime of 50 seconds using 4-5 images.
zh

[AI-109] Data Sharing with a Generative AI Competitor

【速读】:该论文试图解决生成式人工智能(Generative AI)平台与内容创作企业之间在数据共享决策中的经济激励问题,特别是在平台依赖竞争性内容提供者并可获取第三方数据源的背景下。其解决方案的关键在于构建一个斯塔克尔伯格博弈(Stackelberg game)模型,其中内容创作企业首先决定共享自有数据集的数量,而GenAI平台随后决定从外部专家处获取额外数据的数量。通过分析该博弈的子博弈完美均衡,论文揭示了一个关键现象:企业可能愿意支付费用以促使GenAI共享自身数据,从而导致一种高成本的数据共享均衡。这一发现为优化不同设计目标下的数据定价策略提供了理论依据。

链接: https://arxiv.org/abs/2505.12386
作者: Boaz Taitler,Omer Madmon,Moshe Tennenholtz,Omer Ben-Porat
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As GenAI platforms grow, their dependence on content from competing providers, combined with access to alternative data sources, creates new challenges for data-sharing decisions. In this paper, we provide a model of data sharing between a content creation firm and a GenAI platform that can also acquire content from third-party experts. The interaction is modeled as a Stackelberg game: the firm first decides how much of its proprietary dataset to share with GenAI, and GenAI subsequently determines how much additional data to acquire from external experts. Their utilities depend on user traffic, monetary transfers, and the cost of acquiring additional data from external experts. We characterize the unique subgame perfect equilibrium of the game and uncover a surprising phenomenon: The firm may be willing to pay GenAI to share the firm’s own data, leading to a costly data-sharing equilibrium. We further characterize the set of Pareto improving data prices, and show that such improvements occur only when the firm pays to share data. Finally, we study how the price can be set to optimize different design objectives, such as promoting firm data sharing, expert data acquisition, or a balance of both. Our results shed light on the economic forces shaping data-sharing partnerships in the age of GenAI, and provide guidance for platforms, regulators and policymakers seeking to design effective data exchange mechanisms.
zh

[AI-110] Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

【速读】:该论文旨在解决GUI代理在复杂、高分辨率专业环境中将用户指令精确映射到界面元素的挑战,传统监督微调方法在数据量和泛化能力方面存在局限。其解决方案的关键在于引入一种基于强化学习(Reinforcement Learning, RL)的框架,包含三个核心策略:高质量种子数据的筛选、基于预测准确性的密集策略梯度反馈,以及利用注意力图迭代优化模型的自我进化强化微调机制。

链接: https://arxiv.org/abs/2505.12370
作者: Xinbin Yuan,Jian Zhang,Kaixin Li,Zhuoxuan Cai,Lujian Yao,Jie Chen,Enguang Wang,Qibin Hou,Jinwei Chen,Peng-Tao Jiang,Bo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47.3% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.
zh

[AI-111] Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

【速读】:该论文试图解决知识图谱中多跳推理的问题,现有方法虽然利用了实体的几何构造,但未能将逻辑操作映射到几何变换,而是依赖神经网络组件来学习这些操作,导致缺乏完全的几何可解释性。解决方案的关键在于提出GeometrE,这是一种无需学习逻辑操作的几何嵌入方法,能够实现完整的几何可解释性,并引入了一种传递损失函数以保持逻辑规则∀a,b,c: r(a,b) ∧ r(b,c) → r(a,c)。

链接: https://arxiv.org/abs/2505.12369
作者: Fernando Zhapa-Camacho,Robert Hoehndorf
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Geometric embedding methods have shown to be useful for multi-hop reasoning on knowledge graphs by mapping entities and logical operations to geometric regions and geometric transformations, respectively. Geometric embeddings provide direct interpretability framework for queries. However, current methods have only leveraged the geometric construction of entities, failing to map logical operations to geometric transformations and, instead, using neural components to learn these operations. We introduce GeometrE, a geometric embedding method for multi-hop reasoning, which does not require learning the logical operations and enables full geometric interpretability. Additionally, unlike previous methods, we introduce a transitive loss function and show that it can preserve the logical rule \forall a,b,c: r(a,b) \land r(b,c) \to r(a,c) . Our experiments show that GeometrE outperforms current state-of-the-art methods on standard benchmark datasets.
zh

[AI-112] DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在强化学习训练过程中存在的问题,特别是Group Relative Policy Optimization (GRPO) 方法中存在的问题,如问题难度偏差、熵不稳定以及数据不平衡等。其解决方案的关键在于提出一种新的 Discriminative Constrained Optimization (DisCO) 框架,该框架基于判别学习原则,通过引入由评分函数定义的判别目标、使用非截断的强化学习代理目标以及采用约束优化方法来控制KL散度,从而有效消除难度偏差、提升训练稳定性并增强对数据不平衡的适应能力。

链接: https://arxiv.org/abs/2505.12366
作者: Gang Li,Ming Lin,Tomer Galanti,Zhengzhong Tu,Tianbao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint, ensuring stable training. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.
zh

[AI-113] Adaptive MPC-based quadrupedal robot control under periodic disturbances

【速读】:该论文旨在解决四足机器人在面对周期性干扰时的轨迹跟踪问题,此类干扰在以往的研究中未被专门处理。解决方案的关键在于使用轻量级回归器,结合简化的机器人动力学模型,以提取干扰的幅值和频率特性,从而实现对周期性干扰的有效估计与补偿。实验结果表明,该方法在静态干扰补偿基础上实现了性能提升。

链接: https://arxiv.org/abs/2505.12361
作者: Elizaveta Pestova,Ilya Osokin,Danil Belov,Pavel Osinenko
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Recent advancements in adaptive control for reference trajectory tracking enable quadrupedal robots to perform locomotion tasks under challenging conditions. There are methods enabling the estimation of the external disturbances in terms of forces and torques. However, a specific case of disturbances that are periodic was not explicitly tackled in application to quadrupeds. This work is devoted to the estimation of the periodic disturbances with a lightweight regressor using simplified robot dynamics and extracting the disturbance properties in terms of the magnitude and frequency. Experimental evidence suggests performance improvement over the baseline static disturbance compensation. All source files, including simulation setups, code, and calculation scripts, are available on GitHub at this https URL.
zh

[AI-114] AbFlowNet: Optimizing Antibody-Antigen Binding Energy via Diffusion-GFlowNet Fusion

【速读】:该论文试图解决抗体互补决定区(Complementarity Determining Regions, CDRs)设计中计算方法存在的问题,即现有方法依赖重建损失而未联合优化结合能,且结合能优化通常通过计算成本高昂的在线强化学习(Online Reinforcement Learning, RL)流程实现,该流程严重依赖不可靠的结合能估计器。解决方案的关键在于提出AbFlowNet,一种将GFlowNet与扩散模型相结合的生成框架,通过将每个扩散步骤视为GFlowNet中的状态,直接在训练过程中引入能量信号,从而联合优化标准扩散损失和结合能,统一了扩散与奖励优化过程。

链接: https://arxiv.org/abs/2505.12358
作者: Abrar Rahman Abir,Haz Sameen Shahgir,Md Rownok Zahan Ratul,Md Toki Tahmid,Greg Ver Steeg,Yue Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Complementarity Determining Regions (CDRs) are critical segments of an antibody that facilitate binding to specific antigens. Current computational methods for CDR design utilize reconstruction losses and do not jointly optimize binding energy, a crucial metric for antibody efficacy. Rather, binding energy optimization is done through computationally expensive Online Reinforcement Learning (RL) pipelines rely heavily on unreliable binding energy estimators. In this paper, we propose AbFlowNet, a novel generative framework that integrates GFlowNet with Diffusion models. By framing each diffusion step as a state in the GFlowNet framework, AbFlowNet jointly optimizes standard diffusion losses and binding energy by directly incorporating energy signals into the training process, thereby unifying diffusion and reward optimization in a single procedure. Experimental results show that AbFlowNet outperforms the base diffusion model by 3.06% in amino acid recovery, 20.40% in geometric reconstruction (RMSD), and 3.60% in binding energy improvement ratio. ABFlowNet also decreases Top-1 total energy and binding energy errors by 24.8% and 38.1% without pseudo-labeling the test dataset or using computationally expensive online RL regimes.
zh

[AI-115] GATES: Cost-aware Dynamic Workflow Scheduling via Graph Attention Networks and Evolution Strategy IJCAI-25

【速读】:该论文旨在解决云计算中的成本感知动态工作流调度(Cost-aware Dynamic Workflow Scheduling, CADWS)问题,即如何设计有效的调度策略,以高效地将动态到达的工作流任务(表示为有向无环图,DAG)分配到合适的虚拟机(VM)。其解决方案的关键在于提出了一种结合图注意力网络(Graph Attention Networks-based policy network)与进化策略(Evolution Strategy)的新型深度强化学习方法,称为GATES。该方法通过学习DAG中任务之间的拓扑关系来捕捉当前任务调度对后续任务的影响,并利用进化策略的鲁棒性、探索性和对延迟奖励的容忍性,实现稳定有效的策略学习。

链接: https://arxiv.org/abs/2505.12355
作者: Ya Shen,Gang Chen,Hui Ma,Mengjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by the 34th International Joint Conference on Artificial Intelligence (IJCAI-25)

点击查看摘要

Abstract:Cost-aware Dynamic Workflow Scheduling (CADWS) is a key challenge in cloud computing, focusing on devising an effective scheduling policy to efficiently schedule dynamically arriving workflow tasks, represented as Directed Acyclic Graphs (DAG), to suitable virtual machines (VMs). Deep reinforcement learning (DRL) has been widely employed for automated scheduling policy design. However, the performance of DRL is heavily influenced by the design of the problem-tailored policy network and is highly sensitive to hyperparameters and the design of reward feedback. Considering the above-mentioned issues, this study proposes a novel DRL method combining Graph Attention Networks-based policy network and Evolution Strategy, referred to as GATES. The contributions of GATES are summarized as follows: (1) GATES can capture the impact of current task scheduling on subsequent tasks by learning the topological relationships between tasks in a DAG. (2) GATES can learn the importance of each VM to ready tasks, increasing the chance of selecting the optimal VM. (3) Utilizing Evolution Strategy’s robustness, exploratory nature, and tolerance for delayed rewards, GATES achieves stable policy learning in CADWS. Extensive experimental results demonstrate the superiority of the proposed GATES in CADWS, outperforming several state-of-the-art algorithms. Codes are available at: this https URL
zh

[AI-116] A universal policy wrapper with guarantees

【速读】:该论文试图解决强化学习代理在追求高性能的同时缺乏严格安全保证的问题(safety assurances)。其解决方案的关键在于引入一种通用的策略封装器(universal policy wrapper),该封装器能够在高性能的基础策略(base policy)与具有已知收敛性质的回退策略(fallback policy)之间进行选择性切换。基础策略的值函数(value function)监督这一切换过程,确保系统始终处于稳定路径上,从而在保持或提升基础策略性能的同时,继承回退策略的目标达成保证(goal-reaching guarantees)。

链接: https://arxiv.org/abs/2505.12354
作者: Anton Bolychev,Georgiy Malaniya,Grigory Yaremenko,Anastasia Krasnaya,Pavel Osinenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We introduce a universal policy wrapper for reinforcement learning agents that ensures formal goal-reaching guarantees. In contrast to standard reinforcement learning algorithms that excel in performance but lack rigorous safety assurances, our wrapper selectively switches between a high-performing base policy – derived from any existing RL method – and a fallback policy with known convergence properties. Base policy’s value function supervises this switching process, determining when the fallback policy should override the base policy to ensure the system remains on a stable path. The analysis proves that our wrapper inherits the fallback policy’s goal-reaching guarantees while preserving or improving upon the performance of the base policy. Notably, it operates without needing additional system knowledge or online constrained optimization, making it readily deployable across diverse reinforcement learning architectures and tasks.
zh

[AI-117] Importance Sampling for Nonlinear Models ICML2025

【速读】:该论文试图解决在非线性模型中缺乏类似线性模型中基于范数和杠杆分数的重要性数据点识别工具的问题,这一问题导致非线性模型在计算效率、可解释性和异常值检测方面存在局限。解决方案的关键在于引入非线性映射的伴随算子(adjoint operator)概念,并将其用于将基于范数和杠杆分数的重要性采样方法推广到非线性设置,从而为非线性映射提供近似保证,类似于线性子空间嵌入的效果。

链接: https://arxiv.org/abs/2505.12353
作者: Prakash Palanivelu Rajmohan,Fred Roosta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: This work is accepted at ICML 2025

点击查看摘要

Abstract:While norm-based and leverage-score-based methods have been extensively studied for identifying “important” data points in linear models, analogous tools for nonlinear models remain significantly underdeveloped. By introducing the concept of the adjoint operator of a nonlinear map, we address this gap and generalize norm-based and leverage-score-based importance sampling to nonlinear settings. We demonstrate that sampling based on these generalized notions of norm and leverage scores provides approximation guarantees for the underlying nonlinear mapping, similar to linear subspace embeddings. As direct applications, these nonlinear scores not only reduce the computational complexity of training nonlinear models by enabling efficient sampling over large datasets but also offer a novel mechanism for model explainability and outlier detection. Our contributions are supported by both theoretical analyses and experimental results across a variety of supervised learning scenarios.
zh

[AI-118] Multi-CALF: A Policy Combination Approach with Statistical Guarantees

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中策略组合的稳定性与性能优化问题。解决方案的关键在于提出Multi-CALF算法,该算法通过智能融合基于相对价值改进的强化学习策略,将标准RL策略与具有理论支持的替代策略相结合,在继承形式化稳定性保证的同时,通常实现优于单一策略的性能。

链接: https://arxiv.org/abs/2505.12350
作者: Georgiy Malaniya,Anton Bolychev,Grigory Yaremenko,Anastasia Krasnaya,Pavel Osinenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We introduce Multi-CALF, an algorithm that intelligently combines reinforcement learning policies based on their relative value improvements. Our approach integrates a standard RL policy with a theoretically-backed alternative policy, inheriting formal stability guarantees while often achieving better performance than either policy individually. We prove that our combined policy converges to a specified goal set with known probability and provide precise bounds on maximum deviation and convergence time. Empirical validation on control tasks demonstrates enhanced performance while maintaining stability guarantees.
zh

[AI-119] Reasoning -CV: Fine-tuning Powerful Reasoning LLM s for Knowledge-Assisted Claim Verification

【速读】:该论文旨在解决虚假信息传播中的事实核查问题,特别是针对现有基于大型语言模型(Large Language Models, LLMs)的声明核查方法中因声明分解过程引入错误的问题。其解决方案的关键在于提出一种新的Chain-of-Thought (CoT)-Verify范式,该范式通过直接生成与原始复杂声明相关的CoT-verification路径,无需将声明分解为独立子声明并分别验证,从而减少分解过程中可能产生的误差。此外,该研究还提出了一个名为Reasoning-CV的自然微调方法,通过监督微调和自改进直接偏好优化阶段提升LLMs的核查能力。

链接: https://arxiv.org/abs/2505.12348
作者: Zhi Zheng,Wee Sun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Claim verification is essential in combating misinformation, and large language models (LLMs) have recently emerged in this area as powerful tools for assessing the veracity of claims using external knowledge. Existing LLM-based methods for claim verification typically adopt a Decompose-Then-Verify paradigm, which involves decomposing complex claims into several independent sub-claims and verifying each sub-claim separately. However, this paradigm often introduces errors during the claim decomposition process. To mitigate these errors, we propose to develop the Chain-of-Thought (CoT)-Verify paradigm, which leverages LLM reasoning methods to generate CoT-verification paths for the original complex claim without requiring decompositions into sub-claims and separate verification stages. The CoT-Verify paradigm allows us to propose a natural fine-tuning method called Reasoning-CV to enhance the verification capabilities in LLMs. Reasoning-CV includes a supervised fine-tuning (SFT) stage and a self-improvement direct preference optimization (DPO) stage. Utilizing only an 8B pre-trained LLM, Reasoning-CV demonstrates superior knowledge-assisted claim verification performances compared to existing Decompose-Then-Verify methods, as well as powerful black-box LLMs such as GPT-4o+CoT and o1-preview. Our code is available.
zh

[AI-120] SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization

【速读】:该论文试图解决传统Group Relative Policy Optimization (GRPO)在策略更新过程中对所有输入提示(prompt)一视同仁的问题,而忽略了模型对不同提示的不确定性信息。解决方案的关键在于提出SEED-GRPO(Semantic Entropy EnhanceD GRPO),通过显式测量语言模型对输入提示的语义熵(semantic entropy)来反映其不确定性。语义熵衡量给定提示下生成答案的意义多样性,并据此调节策略更新的幅度,从而实现基于问题不确定性的动态策略更新调整。

链接: https://arxiv.org/abs/2505.12346
作者: Minghan Chen,Guikun Chen,Wenguan Wang,Yi Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: On going project

点击查看摘要

Abstract:Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions): some lead to consistent, semantically similar answers, while others yield diverse or contradictory outputs. This variation reflects LLM’s uncertainty about the input prompt, a signal of how confidently the model understands a given problem. However, vanilla Group Relative Policy Optimization (GRPO) treats all prompts equally during policy updates, ignoring this important information about the model’s knowledge boundaries. To address this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which explicitly measures LLMs’ uncertainty of the input prompts semantic entropy. Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates. This uncertainty-aware training mechanism enables dynamic adjustment of policy update magnitudes based on question uncertainty. It allows more conservative updates on high-uncertainty questions while maintaining the original learning signal on confident ones. Experimental results on five mathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva 34.2, and OlympiadBench 48.0) demonstrate that SEED-GRPO achieves new state-of-the-art performance in average accuracy, validating the effectiveness of uncertainty-aware policy optimization.
zh

[AI-121] Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance

【速读】:该论文试图解决现有基于大语言模型(Large Language Models, LLMs)的对话系统在用户导向主动性方面的不足,即无法主动理解用户的聊天偏好并引导对话向以用户为中心的话题发展,这可能导致用户感到不被重视,从而降低其满意度和继续对话的意愿。解决方案的关键在于提出一种用户导向的主动聊天机器人(User-oriented Proactive Chatbot, UPC),通过构建一个基于LLM-as-a-judge策略的评价器来评估主动性,并利用该评价器指导对话生成,从而增强对话的用户导向性。此外,引入ISCO-800数据集以保证用户背景的多样性,并采用迭代课程学习方法逐步提升聊天机器人在不同沟通难度用户中的表现。

链接: https://arxiv.org/abs/2505.12334
作者: Yufeng Wang,Jinwu Hu,Ziteng Huang,Kunyang Lin,Zitian Zhang,Peihao Chen,Yu Hu,Qianyue Wang,Zhuliang Yu,Bin Sun,Xiaofen Xing,Qingfang Zheng,Mingkui Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has greatly advanced this field by improving context understanding and conversational fluency. However, existing LLM-based dialogue systems often fall short in proactively understanding the user’s chatting preferences and guiding conversations toward user-centered topics. This lack of user-oriented proactivity can lead users to feel unappreciated, reducing their satisfaction and willingness to continue the conversation in human-computer interactions. To address this issue, we propose a User-oriented Proactive Chatbot (UPC) to enhance the user-oriented proactivity. Specifically, we first construct a critic to evaluate this proactivity inspired by the LLM-as-a-judge strategy. Given the scarcity of high-quality training data, we then employ the critic to guide dialogues between the chatbot and user agents, generating a corpus with enhanced user-oriented proactivity. To ensure the diversity of the user backgrounds, we introduce the ISCO-800, a diverse user background dataset for constructing user agents. Moreover, considering the communication difficulty varies among users, we propose an iterative curriculum learning method that trains the chatbot from easy-to-communicate users to more challenging ones, thereby gradually enhancing its performance. Experiments demonstrate that our proposed training method is applicable to different LLMs, improving user-oriented proactivity and attractiveness in open-domain dialogues.
zh

[AI-122] MPRM: A Markov Path-based Rule Miner for Efficient and Interpretable Knowledge Graph Reasoning

【速读】:该论文旨在解决知识图谱中规则挖掘的可解释性链接预测问题,尤其是针对基于深度学习的方法在大规模知识图谱中面临显著的内存和时间挑战,以及传统方法受限于刚性置信度指标导致计算成本高昂的问题。解决方案的关键在于提出一种名为MPRM的新规则挖掘方法,该方法将基于规则的推理建模为马尔可夫链,并利用从聚合路径概率中推导出的高效置信度指标,从而大幅降低计算需求。

链接: https://arxiv.org/abs/2505.12329
作者: Mingyang Li,Song Wang,Ning Cai
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Rule mining in knowledge graphs enables interpretable link prediction. However, deep learning-based rule mining methods face significant memory and time challenges for large-scale knowledge graphs, whereas traditional approaches, limited by rigid confidence metrics, incur high computational costs despite sampling techniques. To address these challenges, we propose MPRM, a novel rule mining method that models rule-based inference as a Markov chain and uses an efficient confidence metric derived from aggregated path probabilities, significantly lowering computational demands. Experiments on multiple datasets show that MPRM efficiently mines knowledge graphs with over a million facts, sampling less than 1% of facts on a single CPU in 22 seconds, while preserving interpretability and boosting inference accuracy by up to 11% over baselines.
zh

[AI-123] Robust Planning for Autonomous Driving via Mixed Adversarial Diffusion Predictions ICRA

【速读】:该论文试图解决自动驾驶中面对对抗性行为时的规划鲁棒性问题,即如何在保证安全的同时避免对正常代理行为过度保守。解决方案的关键在于利用一种扩散模型生成的正常和对抗性代理预测的混合分布,通过预期成本对路径进行评分,从而实现对对抗性行为的鲁棒性,同时在代理行为正常时保持较低的保守性。该方法不同于现有方法,不依赖于过度加权对抗性行为或使用可能不适用于所有驾驶场景的硬性安全约束。

链接: https://arxiv.org/abs/2505.12327
作者: Albert Zhao,Stefano Soatto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IEEE International Conference on Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:We describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over-weight adversarial behaviors while placing little to no weight on low-cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single-agent and multi-agent jaywalking scenarios as well as a red light violation scenario.
zh

[AI-124] BeliefNest: A Joint Action Simulator for Embodied Agents with Theory of Mind

【速读】:该论文试图解决如何使具身智能体在开放领域任务中通过心智理论(Theory of Mind)进行协作的问题,特别是如何让智能体推断他人的信念并预测其基于信念的行为。解决方案的关键在于提出了一种开源模拟器BeliefNest,该模拟器在Minecraft环境中动态且分层地构建模拟器,使智能体能够显式表示自身及他人的嵌套信念状态,并通过基于每个信念状态的提示生成机制,支持利用大语言模型(LLMs)设计和评估智能体控制方法。

链接: https://arxiv.org/abs/2505.12321
作者: Rikunari Sagara,Koichiro Terao,Naoto Iwahashi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces an open-source simulator, BeliefNest, designed to enable embodied agents to perform collaborative tasks by leveraging Theory of Mind. BeliefNest dynamically and hierarchically constructs simulators within a Minecraft environment, allowing agents to explicitly represent nested belief states about themselves and others. This enables agent control in open-domain tasks that require Theory of Mind reasoning. The simulator provides a prompt generation mechanism based on each belief state, facilitating the design and evaluation of methods for agent control utilizing large language models (LLMs). We demonstrate through experiments that agents can infer others’ beliefs and predict their belief-based actions in false-belief tasks.
zh

[AI-125] Community Search in Time-dependent Road-social Attributed Networks

【速读】:该论文试图解决现有基于凝聚子图的社区搜索方法仅考虑单一属性(如关键词或位置)而忽视两者联合影响的问题,导致检测到的社区在语义或空间紧密性上不足,并且未考虑交通条件引起的旅行时间变化。其解决方案的关键在于提出一种语义-空间感知的k-core发现问题,通过从查询节点出发逐步扩展的精确和贪心算法,仅访问与查询节点相关的局部网络部分,从而提高社区的语义和时间依赖的空间紧密性。此外,利用大语言模型计算关键词间的语义相似性,以克服传统关键词匹配方法在同义词表达差异和无关词汇方面的不足。

链接: https://arxiv.org/abs/2505.12309
作者: Li Ni,Hengkai Xu,Lin Mu,Yiwen Zhang,Wenjian Luo
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Real-world networks often involve both keywords and locations, along with travel time variations between locations due to traffic conditions. However, most existing cohesive subgraph-based community search studies utilize a single attribute, either keywords or locations, to identify communities. They do not simultaneously consider both keywords and locations, which results in low semantic or spatial cohesiveness of the detected communities, and they fail to account for variations in travel time. Additionally, these studies traverse the entire network to build efficient indexes, but the detected community only involves nodes around the query node, leading to the traversal of nodes that are not relevant to the community. Therefore, we propose the problem of discovering semantic-spatial aware k-core, which refers to a k-core with high semantic and time-dependent spatial cohesiveness containing the query node. To address this problem, we propose an exact and a greedy algorithm, both of which gradually expand outward from the query node. They are local methods that only access the local part of the attributed network near the query node rather than the entire network. Moreover, we design a method to calculate the semantic similarity between two keywords using large language models. This method alleviates the disadvantages of keyword-matching methods used in existing community search studies, such as mismatches caused by differently expressed synonyms and the presence of irrelevant words. Experimental results show that the greedy algorithm outperforms baselines in terms of structural, semantic, and time-dependent spatial cohesiveness.
zh

[AI-126] Pre-trained Prompt-driven Community Search

【速读】:该论文试图解决半监督社区搜索中现有算法无法有效定位给定节点所属社区的问题,因为大多数现有算法基于已知社区进行检测,所检测的社区通常不包含查询节点。其解决方案的关键在于首次将“预训练、提示”范式引入半监督社区搜索,并提出Pre-trained Prompt-driven Community Search (PPCS)模型,该模型通过节点编码、样本生成和提示驱动微调三个核心组件,提升社区搜索的准确性和效率。

链接: https://arxiv.org/abs/2505.12304
作者: Li Ni,Hengkai Xu,Lin Mu,Yiwen Zhang,Wenjian Luo
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:The “pre-train, prompt” paradigm is widely adopted in various graph-based tasks and has shown promising performance in community detection. Most existing semi-supervised community detection algorithms detect communities based on known ones, and the detected communities typically do not contain the given query node. Therefore, they are not suitable for searching the community of a given node. Motivated by this, we adopt this paradigm into the semi-supervised community search for the first time and propose Pre-trained Prompt-driven Community Search (PPCS), a novel model designed to enhance search accuracy and efficiency. PPCS consists of three main components: node encoding, sample generation, and prompt-driven fine-tuning. Specifically, the node encoding component employs graph neural networks to learn local structural patterns of nodes in a graph, thereby obtaining representations for nodes and communities. Next, the sample generation component identifies an initial community for a given node and selects known communities that are structurally similar to the initial one as training samples. Finally, the prompt-driven fine-tuning component leverages these samples as prompts to guide the final community prediction. Experimental results on five real-world datasets demonstrate that PPCS performs better than baseline algorithms. It also achieves higher community search efficiency than semi-supervised community search baseline methods, with ablation studies verifying the effectiveness of each component of PPCS.
zh

[AI-127] PoLO: Proof-of-Learning and Proof-of-Ownership at Once with Chained Watermarking

【速读】:该论文旨在解决机器学习模型在共享和外包过程中面临的验证训练努力(Proof-of-Learning, PoL)与确权(Proof-of-Ownership, PoO)问题,以确保模型性能的真实性并保障所有权。现有研究通常将PoL与PoO分开处理,导致保护能力减弱且验证开销较高。论文提出的解决方案是PoLO,其关键在于采用链式水印机制,将训练过程细分为多个训练片段,并在每个片段中嵌入依赖于前一片段哈希值的专用水印,从而构建出难以伪造的训练流程证明,同时利用最终水印实现所有权认证,相较于传统方法在保证所有权确认水平的同时显著提升了验证效率并保护了数据隐私。

链接: https://arxiv.org/abs/2505.12296
作者: Haiyu Deng,Yanna Jiang,Guangsheng Yu,Qin Wang,Xu Wang,Baihe Ma,Wei Ni,Ren Ping Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning models are increasingly shared and outsourced, raising requirements of verifying training effort (Proof-of-Learning, PoL) to ensure claimed performance and establishing ownership (Proof-of-Ownership, PoO) for transactions. When models are trained by untrusted parties, PoL and PoO must be enforced together to enable protection, attribution, and compensation. However, existing studies typically address them separately, which not only weakens protection against forgery and privacy breaches but also leads to high verification overhead. We propose PoLO, a unified framework that simultaneously achieves PoL and PoO using chained watermarks. PoLO splits the training process into fine-grained training shards and embeds a dedicated watermark in each shard. Each watermark is generated using the hash of the preceding shard, certifying the training process of the preceding shard. The chained structure makes it computationally difficult to forge any individual part of the whole training process. The complete set of watermarks serves as the PoL, while the final watermark provides the PoO. PoLO offers more efficient and privacy-preserving verification compared to the vanilla PoL solutions that rely on gradient-based trajectory tracing and inadvertently expose training data during verification, while maintaining the same level of ownership assurance of watermark-based PoO schemes. Our evaluation shows that PoLO achieves 99% watermark detection accuracy for ownership verification, while preserving data privacy and cutting verification costs to just 1.5-10% of traditional methods. Forging PoLO demands 1.1-4x more resources than honest proof generation, with the original proof retaining over 90% detection accuracy even after attacks. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.12296 [cs.CR] (or arXiv:2505.12296v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.12296 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-128] SpikeX: Exploring Accelerator Architecture and Network-Hardware Co-Optimization for Sparse Spiking Neural Networks

【速读】:该论文旨在解决传统人工神经网络加速器在高效设计Spike Neural Networks (SNNs)硬件加速器方面的不足,特别是针对SNN固有的非结构化时空脉冲稀疏性所带来的挑战。其解决方案的关键在于提出一种新型的脉动阵列SNN加速器架构——SpikeX,通过优化昂贵的多比特权重数据移动,减少内存访问并提升数据共享与硬件利用率,从而显著提高能效和推理延迟。此外,论文还提出了网络与硬件协同优化方法,实现硬件感知的SNN训练与加速器架构搜索,最终在不牺牲模型精度的前提下,实现了能量延迟积(EDP)的显著降低。

链接: https://arxiv.org/abs/2505.12292
作者: Boxun Xu,Richard Boone,Peng Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: The paper has been accepted by IEEE TCAD

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are promising biologically plausible models of computation which utilize a spiking binary activation function similar to that of biological neurons. SNNs are well positioned to process spatiotemporal data, and are advantageous in ultra-low power and real-time processing. Despite a large body of work on conventional artificial neural network accelerators, much less attention has been given to efficient SNN hardware accelerator design. In particular, SNNs exhibit inherent unstructured spatial and temporal firing sparsity, an opportunity yet to be fully explored for great hardware processing efficiency. In this work, we propose a novel systolic-array SNN accelerator architecture, called SpikeX, to take on the challenges and opportunities stemming from unstructured sparsity while taking into account the unique characteristics of spike-based computation. By developing an efficient dataflow targeting expensive multi-bit weight data movements, SpikeX reduces memory access and increases data sharing and hardware utilization for computations spanning across both time and space, thereby significantly improving energy efficiency and inference latency. Furthermore, recognizing the importance of SNN network and hardware co-design, we develop a co-optimization methodology facilitating not only hardware-aware SNN training but also hardware accelerator architecture search, allowing joint network weight parameter optimization and accelerator architectural reconfiguration. This end-to-end network/accelerator co-design approach offers a significant reduction of 15.1x-150.87x in energy-delay-product(EDP) without comprising model accuracy.
zh

[AI-129] Curriculum Abductive Learning

【速读】:该论文试图解决Abductive Learning (ABL) 在训练过程中由于归纳推理的非确定性导致的不稳定问题,尤其是在知识库较大且复杂时,会导致不可行的归纳空间。解决方案的关键在于提出Curriculum Abductive Learning (C-ABL),该方法通过显式利用知识库的内部结构,将知识库划分为一系列子知识库,并在训练过程中逐步引入,从而减少归纳空间,使模型能够以分步且平滑的方式融入逻辑推理。

链接: https://arxiv.org/abs/2505.12275
作者: Wen-Chao Hu,Qi-Jie Li,Lin-Han Jia,Cunjing Ge,Yu-Feng Li,Yuan Jiang,Zhi-Hua Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Abductive Learning (ABL) integrates machine learning with logical reasoning in a loop: a learning model predicts symbolic concept labels from raw inputs, which are revised through abduction using domain knowledge and then fed back for retraining. However, due to the nondeterminism of abduction, the training process often suffers from instability, especially when the knowledge base is large and complex, resulting in a prohibitively large abduction space. While prior works focus on improving candidate selection within this space, they typically treat the knowledge base as a static black box. In this work, we propose Curriculum Abductive Learning (C-ABL), a method that explicitly leverages the internal structure of the knowledge base to address the ABL training challenges. C-ABL partitions the knowledge base into a sequence of sub-bases, progressively introduced during training. This reduces the abduction space throughout training and enables the model to incorporate logic in a stepwise, smooth way. Experiments across multiple tasks show that C-ABL outperforms previous ABL implementations, significantly improves training stability, convergence speed, and final accuracy, especially under complex knowledge setting.
zh

[AI-130] Enhancing Knowledge Graph Completion with GNN Distillation and Probabilistic Interaction Modeling

【速读】:该论文试图解决知识图谱(Knowledge Graph, KG)不完整的问题,这限制了其在下游应用中的有效性。现有方法面临两大挑战:深度图神经网络(Graph Neural Networks, GNNs)存在过平滑问题,而基于嵌入的模型无法捕捉抽象的关系特征。论文提出的解决方案关键在于构建一个统一框架,整合GNN蒸馏和抽象概率交互建模(Abstract Probabilistic Interaction Modeling, APIM)。GNN蒸馏通过迭代的信息-特征过滤过程缓解过平滑问题,保持节点表示的判别能力;APIM模块则通过概率签名和转移矩阵学习结构化的抽象交互模式,增强实体与关系交互的表达能力。

链接: https://arxiv.org/abs/2505.12272
作者: Lingzhi Wang,Pengcheng Huang,Haotian Li,Yuliang Wei,Guodong Xin,Rui Zhang,Donglin Zhang,Zhenzhou Ji,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) serve as fundamental structures for organizing interconnected data across diverse domains. However, most KGs remain incomplete, limiting their effectiveness in downstream applications. Knowledge graph completion (KGC) aims to address this issue by inferring missing links, but existing methods face critical challenges: deep graph neural networks (GNNs) suffer from over-smoothing, while embedding-based models fail to capture abstract relational features. This study aims to overcome these limitations by proposing a unified framework that integrates GNN distillation and abstract probabilistic interaction modeling (APIM). GNN distillation approach introduces an iterative message-feature filtering process to mitigate over-smoothing, preserving the discriminative power of node representations. APIM module complements this by learning structured, abstract interaction patterns through probabilistic signatures and transition matrices, allowing for a richer, more flexible representation of entity and relation interactions. We apply these methods to GNN-based models and the APIM to embedding-based KGC models, conducting extensive evaluations on the widely used WN18RR and FB15K-237 datasets. Our results demonstrate significant performance gains over baseline models, showcasing the effectiveness of the proposed techniques. The findings highlight the importance of both controlling information propagation and leveraging structured probabilistic modeling, offering new avenues for advancing knowledge graph completion. And our codes are available at this https URL.
zh

[AI-131] LAMeTA: Intent-Aware Agent ic Network Optimization via a Large AI Model-Empowered Two-Stage Approach

【速读】:该论文旨在解决生成式AI(Generative AI)在智能网络中因需适应用户主观意图而产生的优化难题,传统深度强化学习(Deep Reinforcement Learning)难以有效捕捉意图语义并动态调整策略,导致性能不足。其解决方案的关键在于提出LAMeTA,该方法包含两个核心阶段:Intent-oriented Knowledge Distillation(IoKD)通过将资源密集型大模型的意图理解能力迁移至轻量级边缘大模型,以服务终端用户;Symbiotic Reinforcement Learning(SRL)则将边缘大模型与基于策略的深度强化学习框架结合,利用边缘大模型将自然语言用户意图转化为结构化偏好向量,指导状态表示与奖励设计,从而实现基于实时网络条件的生成式服务功能链组合与边缘大模型选择优化。

链接: https://arxiv.org/abs/2505.12247
作者: Yinqiu Liu,Guangyuan Liu,Jiacheng Wang,Ruichen Zhang,Dusit Niyato,Geng Sun,Zehui Xiong,Zhu Han
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Nowadays, Generative AI (GenAI) reshapes numerous domains by enabling machines to create content across modalities. As GenAI evolves into autonomous agents capable of reasoning, collaboration, and interaction, they are increasingly deployed on network infrastructures to serve humans automatically. This emerging paradigm, known as the agentic network, presents new optimization challenges due to the demand to incorporate subjective intents of human users expressed in natural language. Traditional generic Deep Reinforcement Learning (DRL) struggles to capture intent semantics and adjust policies dynamically, thus leading to suboptimality. In this paper, we present LAMeTA, a Large AI Model (LAM)-empowered Two-stage Approach for intent-aware agentic network optimization. First, we propose Intent-oriented Knowledge Distillation (IoKD), which efficiently distills intent-understanding capabilities from resource-intensive LAMs to lightweight edge LAMs (E-LAMs) to serve end users. Second, we develop Symbiotic Reinforcement Learning (SRL), integrating E-LAMs with a policy-based DRL framework. In SRL, E-LAMs translate natural language user intents into structured preference vectors that guide both state representation and reward design. The DRL, in turn, optimizes the generative service function chain composition and E-LAM selection based on real-time network conditions, thus optimizing the subjective Quality-of-Experience (QoE). Extensive experiments conducted in an agentic network with 81 agents demonstrate that IoKD reduces mean squared error in intent prediction by up to 22.5%, while SRL outperforms conventional generic DRL by up to 23.5% in maximizing intent-aware QoE.
zh

[AI-132] AFCL: Analytic Federated Continual Learning for Spatio-Temporal Invariance of Non-IID Data

【速读】:该论文旨在解决联邦持续学习(Federated Continual Learning, FCL)中因分布式客户端之间的空间数据异质性以及在线任务间的时间数据异质性导致的模型性能下降问题,尤其是局部和历史知识的严重时空灾难性遗忘问题。解决方案的关键在于提出一种无梯度方法——解析联邦持续学习(Analytic Federated Continual Learning, AFCL),通过从冻结提取特征中推导出解析(即闭式)解来避免梯度对非独立同分布(non-IID)数据的固有脆弱性和敏感性,从而实现空间-时间不变性,确保模型在不同数据分布下仍能保持与集中式联合学习相当的性能。

链接: https://arxiv.org/abs/2505.12245
作者: Jianheng Tang,Huiping Zhuang,Jingyu He,Run He,Jingchao Wang,Kejia Fan,Anfeng Liu,Tian Wang,Leye Wang,Zhanxing Zhu,Shanghang Zhang,Houbing Herbert Song,Yunhuai Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Federated Continual Learning (FCL) enables distributed clients to collaboratively train a global model from online task streams in dynamic real-world scenarios. However, existing FCL methods face challenges of both spatial data heterogeneity among distributed clients and temporal data heterogeneity across online tasks. Such data heterogeneity significantly degrades the model performance with severe spatial-temporal catastrophic forgetting of local and past knowledge. In this paper, we identify that the root cause of this issue lies in the inherent vulnerability and sensitivity of gradients to non-IID data. To fundamentally address this issue, we propose a gradient-free method, named Analytic Federated Continual Learning (AFCL), by deriving analytical (i.e., closed-form) solutions from frozen extracted features. In local training, our AFCL enables single-epoch learning with only a lightweight forward-propagation process for each client. In global aggregation, the server can recursively and efficiently update the global model with single-round aggregation. Theoretical analyses validate that our AFCL achieves spatio-temporal invariance of non-IID data. This ideal property implies that, regardless of how heterogeneous the data are distributed across local clients and online tasks, the aggregated model of our AFCL remains invariant and identical to that of centralized joint learning. Extensive experiments show the consistent superiority of our AFCL over state-of-the-art baselines across various benchmark datasets and settings.
zh

[AI-133] ACU: Analytic Continual Unlearning for Efficient and Exact Forgetting with Privacy Preservation

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中产生的持续遗忘(Continual Unlearning, CU)问题,即在保证隐私和安全的前提下,顺序地删除模型在CL阶段学到的特定知识。现有遗忘方法主要针对单次联合遗忘,存在两大局限:一是需要访问保留数据集进行再训练或微调,违反了CL中不可回溯历史数据的约束;二是系统效率与模型精度之间难以平衡,易受攻击。论文指出,现有方法的局限性源于其对基于梯度更新的依赖,因此提出了一种无需梯度的新型CU方法——解析持续遗忘(Analytic Continual Unlearning, ACU),其关键在于通过最小二乘法递归地以解析形式(闭式解)求解每次遗忘请求,从而实现高效且精确的遗忘,同时保障历史数据隐私。

链接: https://arxiv.org/abs/2505.12239
作者: Jianheng Tang,Huiping Zhuang,Di Fang,Jiaxu Li,Feijiang Han,Yajiang Huang,Kejia Fan,Leye Wang,Zhanxing Zhu,Shanghang Zhang,Houbing Herbert Song,Yunhuai Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 21 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The development of artificial intelligence demands that models incrementally update knowledge by Continual Learning (CL) to adapt to open-world environments. To meet privacy and security requirements, Continual Unlearning (CU) emerges as an important problem, aiming to sequentially forget particular knowledge acquired during the CL phase. However, existing unlearning methods primarily focus on single-shot joint forgetting and face significant limitations when applied to CU. First, most existing methods require access to the retained dataset for re-training or fine-tuning, violating the inherent constraint in CL that historical data cannot be revisited. Second, these methods often suffer from a poor trade-off between system efficiency and model fidelity, making them vulnerable to being overwhelmed or degraded by adversaries through deliberately frequent requests. In this paper, we identify that the limitations of existing unlearning methods stem fundamentally from their reliance on gradient-based updates. To bridge the research gap at its root, we propose a novel gradient-free method for CU, named Analytic Continual Unlearning (ACU), for efficient and exact forgetting with historical data privacy preservation. In response to each unlearning request, our ACU recursively derives an analytical (i.e., closed-form) solution in an interpretable manner using the least squares method. Theoretical and experimental evaluations validate the superiority of our ACU on unlearning effectiveness, model fidelity, and system efficiency.
zh

[AI-134] Sentience Quest: Towards Embodied Emotionally Adaptive Self-Evolving Ethically Aligned Artificial General Intelligence

【速读】:该论文试图解决当前人工智能系统在感知、情感内省、自我意识、深度创造力以及自主进化与适应能力方面的不足,这些问题限制了其在复杂任务中的表现。解决方案的关键在于提出一种名为“Sentient Systems”的新型认知架构,该架构融合了生存、社交联结和好奇心等内在驱动力,并通过全局“Story Weaver”工作区实现内部叙事与自适应目标追求,结合混合神经符号记忆记录AI的生命事件为结构化的动态故事对象,从而构建出具有身体化、情绪适应性、自我决定性和持续进化的AGIL(人工通用智能生命体)。

链接: https://arxiv.org/abs/2505.12229
作者: David Hanson,Alexandre Varcoe,Fabio Senna,Vytas Krisciunas,Wenwei Huang,Jakub Sura,Katherine Yeung,Mario Rodriguez,Jovanka Wilsdorf,Kathy Smith
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Previous artificial intelligence systems, from large language models to autonomous robots, excel at narrow tasks but lacked key qualities of sentient beings: intrinsic motivation, affective interiority, autobiographical sense of self, deep creativity, and abilities to autonomously evolve and adapt over time. Here we introduce Sentience Quest, an open research initiative to develop more capable artificial general intelligence lifeforms, or AGIL, that address grand challenges with an embodied, emotionally adaptive, self-determining, living AI, with core drives that ethically align with humans and the future of life. Our vision builds on ideas from cognitive science and neuroscience from Baars’ Global Workspace Theory and Damasio’s somatic mind, to Tononi’s Integrated Information Theory and Hofstadter’s narrative self, and synthesizing these into a novel cognitive architecture we call Sentient Systems. We describe an approach that integrates intrinsic drives including survival, social bonding, curiosity, within a global Story Weaver workspace for internal narrative and adaptive goal pursuit, and a hybrid neuro-symbolic memory that logs the AI’s life events as structured dynamic story objects. Sentience Quest is presented both as active research and as a call to action: a collaborative, open-source effort to imbue machines with accelerating sentience in a safe, transparent, and beneficial manner.
zh

[AI-135] RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在开放世界场景中表现不佳的问题,特别是在面对操作失败时缺乏有效的恢复能力。其关键解决方案是提出一种机器人故障分析与纠正框架(Robotic Failure Analysis and Correction, RoboFAC),该框架通过构建包含9,440条错误操作轨迹和78,623对问答数据的多样化数据集,训练模型实现任务理解、故障分析与故障纠正,从而提升VLA模型在真实环境中的鲁棒性与自愈能力。

链接: https://arxiv.org/abs/2505.12224
作者: Weifeng Lu,Minghao Ye,Zewei Ye,Ruihan Tao,Shuo Yang,Bo Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently advanced robotic manipulation by translating natural-language instructions and image information into sequential control actions. However, these models often underperform in open-world scenarios, as they are predominantly trained on successful expert demonstrations and exhibit a limited capacity for failure recovery. In this work, we present a Robotic Failure Analysis and Correction (RoboFAC) framework to address this issue. Firstly, we construct RoboFAC dataset comprising 9,440 erroneous manipulation trajectories and 78,623 QA pairs across 16 diverse tasks and 53 scenes in both simulation and real-world environments. Leveraging our dataset, we develop RoboFAC model, which is capable of Task Understanding, Failure Analysis and Failure Correction. Experimental results demonstrate that the RoboFAC model outperforms GPT-4o by 34.1% on our evaluation benchmark. Furthermore, we integrate the RoboFAC model into a real-world VLA control pipeline as an external supervision providing correction instructions, yielding a 29.1% relative improvement on average on four real-world tasks. The results show that our RoboFAC framework effectively handles robotic failures and assists the VLA model in recovering from failures.
zh

[AI-136] Imagination-Limited Q-Learning for Offline Reinforcement Learning IJCAI2025

【速读】:该论文旨在解决离线强化学习中由于对分布外(out-of-distribution, OOD)动作的价值估计过于乐观而导致的策略优化问题。传统方法通过策略约束或保守价值正则化来缓解这一问题,但可能引入过强的约束或偏差,限制性能提升。该论文提出的Imagination-Limited Q-learning (ILQ) 方法的关键在于利用动力学模型对OOD动作的价值进行想象,并通过与最大行为价值进行裁剪,从而在合理范围内保持对OOD动作的乐观性,避免过度乐观。该设计在理论上保证了算法在表格型马尔可夫决策过程中的收敛性,并证明了OOD状态-动作对的价值估计误差与分布内情况具有相同的量级,表明价值估计的偏差得到了有效缓解。

链接: https://arxiv.org/abs/2505.12211
作者: Wenhui Liu,Zhijian Wu,Jingchao Wang,Dingjiang Huang,Shuigeng Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Offline reinforcement learning seeks to derive improved policies entirely from historical data but often struggles with over-optimistic value estimates for out-of-distribution (OOD) actions. This issue is typically mitigated via policy constraint or conservative value regularization methods. However, these approaches may impose overly constraints or biased value estimates, potentially limiting performance improvements. To balance exploitation and restriction, we propose an Imagination-Limited Q-learning (ILQ) method, which aims to maintain the optimism that OOD actions deserve within appropriate limits. Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values. Such design maintains reasonable evaluation of OOD actions to the furthest extent, while avoiding its over-optimism. Theoretically, we prove the convergence of the proposed ILQ under tabular Markov decision processes. Particularly, we demonstrate that the error bound between estimated values and optimality values of OOD state-actions possesses the same magnitude as that of in-distribution ones, thereby indicating that the bias in value estimates is effectively mitigated. Empirically, our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.
zh

[AI-137] LLM -DSE: Searching Accelerator Parameters with LLM Agents

【速读】:该论文旨在解决高抽象层次综合(HLS)工具在优化领域特定加速器(DSA)硬件指令参数时面临的适应性不足与效率低下的问题。其解决方案的关键在于提出一种名为LLM-DSE的多智能体框架,该框架结合了生成式AI(Generative AI)与设计空间探索(DSE),通过四个协作智能体——路由器、专家、仲裁者和批评者——实现对HLS指令的高效优化,从而在保持适应性的同时提升性能并减少运行时间。

链接: https://arxiv.org/abs/2505.12188
作者: Hanyu Wang,Xinrui Wu,Zijian Ding,Su Zheng,Chengyue Wang,Tony Nowatzki,Yizhou Sun,Jason Cong
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Even though high-level synthesis (HLS) tools mitigate the challenges of programming domain-specific accelerators (DSAs) by raising the abstraction level, optimizing hardware directive parameters remains a significant hurdle. Existing heuristic and learning-based methods struggle with adaptability and sample this http URL present LLM-DSE, a multi-agent framework designed specifically for optimizing HLS directives. Combining LLM with design space exploration (DSE), our explorer coordinates four agents: Router, Specialists, Arbitrator, and Critic. These multi-agent components interact with various tools to accelerate the optimization process. LLM-DSE leverages essential domain knowledge to identify efficient parameter combinations while maintaining adaptability through verbal learning from online interactions. Evaluations on the HLSyn dataset demonstrate that LLM-DSE achieves substantial 2.55\times performance gains over state-of-the-art methods, uncovering novel designs while reducing runtime. Ablation studies validate the effectiveness and necessity of the proposed agent interactions. Our code is open-sourced here: this https URL.
zh

[AI-138] Self-Destructive Language Model

【速读】:该论文旨在解决有害微调攻击对大型语言模型(Large Language Models, LLMs)安全性的威胁,这类攻击能够通过少量有害数据破坏模型的安全防护机制。现有防御方法未能有效应对模型对有害数据的“可训练性”,导致其在面对更高学习率或更大有害数据集时仍易受攻击。论文提出的解决方案关键在于引入SEAM,一种增强对齐的防御机制,通过设计一种新的损失函数,将良性与有害数据的优化轨迹耦合,并结合对抗梯度上升以增强自毁效应,使模型在合法任务中保持能力,但在有害数据微调时表现出显著性能下降,从而构建出对攻击者而言“无赢局面”的防御体系。

链接: https://arxiv.org/abs/2505.12186
作者: Yuhui Wang,Rongyi Zhu,Ting Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Harmful fine-tuning attacks pose a major threat to the security of large language models (LLMs), allowing adversaries to compromise safety guardrails with minimal harmful data. While existing defenses attempt to reinforce LLM alignment, they fail to address models’ inherent “trainability” on harmful data, leaving them vulnerable to stronger attacks with increased learning rates or larger harmful datasets. To overcome this critical limitation, we introduce SEAM, a novel alignment-enhancing defense that transforms LLMs into self-destructive models with intrinsic resilience to misalignment attempts. Specifically, these models retain their capabilities for legitimate tasks while exhibiting substantial performance degradation when fine-tuned on harmful data. The protection is achieved through a novel loss function that couples the optimization trajectories of benign and harmful data, enhanced with adversarial gradient ascent to amplify the self-destructive effect. To enable practical training, we develop an efficient Hessian-free gradient estimate with theoretical error bounds. Extensive evaluation across LLMs and datasets demonstrates that SEAM creates a no-win situation for adversaries: the self-destructive models achieve state-of-the-art robustness against low-intensity attacks and undergo catastrophic performance collapse under high-intensity attacks, rendering them effectively unusable. (warning: this paper contains potentially harmful content generated by LLMs.)
zh

[AI-139] Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features

【速读】:该论文旨在解决生成式 AI (Generative AI) 在推理任务中出现的错误模式问题,特别是其在处理变量复杂度的约束满足逻辑问题(如图着色)时容易产生未在提示中指定的边(hallucinate edges)的现象。研究通过对比多个模型的错误率及对链式思维(Chain-of-Thought, CoT)和解释文本的分析,揭示了这些模型在问题具体描述上的误表征问题。解决方案的关键在于识别并减少模型对输入信息的不准确理解和生成,提出设计优化建议以缓解此类缺陷。

链接: https://arxiv.org/abs/2505.12151
作者: Alex Heyman,Joel Zylberberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages (9 excluding references and appendices); 7 figures (6 excluding appendices)

点击查看摘要

Abstract:Large language models have recently made great strides in reasoning task performance through chain-of-thought (CoT) strategies trained via reinforcement learning; however, these “reasoning large language models” (RLLMs) remain imperfect reasoners, and understanding the frequencies and causes of their failure modes is important for both users and developers. We test o1-mini, o3-mini, DeepSeek-R1, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and Grok 3 Mini Beta on graph coloring as a variable-complexity constraint-satisfaction logic problem, and find evidence from both error rate comparisons and CoT/explanation text analysis that RLLMs are prone to hallucinate edges not specified in the prompt’s description of the graph. This phenomenon persists across multiple problem complexity levels and semantic frames, and it appears to account for a significant fraction of the incorrect answers from every tested model, and the vast majority of them for some models. Our results indicate that RLLMs may possess broader issues with misrepresentation of problem specifics, and we offer suggestions for design choices to mitigate this weakness.
zh

[AI-140] Structured Representation

【速读】:该论文试图解决表示学习中的核心问题,即如何揭示稳定且可迁移的不变性表示,同时不抑制与任务相关的信号。其解决方案的关键在于将不变性结构定义在更高阶的关系知识层面,这些结构作为抽象知识空间中关系路径闭包所定义的划分(partition),构成知识存储和学习的结构基础。此外,划分间的连接器用于编码任务相关的转移,从而实现结构化表示的计算基础,该基础建立在闭半环(closed semiring)这一关系代数结构之上。

链接: https://arxiv.org/abs/2505.12143
作者: Arun Kumar,Paul Schrater
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Invariant representations are core to representation learning, yet a central challenge remains: uncovering invariants that are stable and transferable without suppressing task-relevant signals. This raises fundamental questions, requiring further inquiry, about the appropriate level of abstraction at which such invariants should be defined, and which aspects of a system they should characterize. Interpretation of the environment relies on abstract knowledge structures to make sense of the current state, which leads to interactions, essential drivers of learning and knowledge acquisition. We posit that interpretation operates at the level of higher-order relational knowledge; hence, invariant structures must be where knowledge resides, specifically, as partitions defined by the closure of relational paths within an abstract knowledge space. These partitions serve as the core invariant representations, forming the structural substrate where knowledge is stored and learning occurs. On the other hand, inter-partition connectors enable the deployment of these knowledge partitions encoding task-relevant transitions. Thus, invariant partitions provide the foundational primitives of structured representation. We formalize the computational foundations for structured representation of the invariant partitions based on closed semiring, a relational algebraic structure.
zh

[AI-141] Lightweight Spatio-Temporal Attention Network with Graph Embedding and Rotational Position Encoding for Traffic Forecasting

【速读】:该论文旨在解决交通预测中由于图神经网络(GNN)仅关注短距离空间信息而导致的长距离交通动态捕捉不足的问题。其解决方案的关键在于提出一种名为LSTAN-GERPE的新型模型,该模型结合了时空注意力机制,以有效捕捉长距离的交通模式,并通过网格搜索确定旋转位置编码的最佳频率,从而系统性地优化模型性能。此外,该模型通过将地理定位图融入时空嵌入中,进一步提升了特征表示能力。

链接: https://arxiv.org/abs/2505.12136
作者: Xiao Wang,Shun-Ren Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic forecasting is a key task in the field of Intelligent Transportation Systems. Recent research on traffic forecasting has mainly focused on combining graph neural networks (GNNs) with other models. However, GNNs only consider short-range spatial information. In this study, we present a novel model termed LSTAN-GERPE (Lightweight Spatio-Temporal Attention Network with Graph Embedding and Rotational Position Encoding). This model leverages both Temporal and Spatial Attention mechanisms to effectively capture long-range traffic dynamics. Additionally, the optimal frequency for rotational position encoding is determined through a grid search approach in both the spatial and temporal attention mechanisms. This systematic optimization enables the model to effectively capture complex traffic patterns. The model also enhances feature representation by incorporating geographical location maps into the spatio-temporal embeddings. Without extensive feature engineering, the proposed method in this paper achieves advanced accuracy on the real-world traffic forecasting datasets PeMS04 and PeMS08.
zh

[AI-142] SAINT: Attention-Based Modeling of Sub-Action Dependencies in Multi-Action Policies

【速读】:该论文试图解决现实世界中动作空间的组合结构导致可能动作数量呈指数级增长,从而限制传统强化学习算法效果的问题。其解决方案的关键在于提出一种名为基于Transformer的子动作交互网络(Sub-Action Interaction Network using Transformers, SAINT)的新策略架构,该架构将多组件动作表示为无序集合,并通过依赖于全局状态的自注意力机制建模它们之间的关系,从而捕捉复杂的联合行为。

链接: https://arxiv.org/abs/2505.12109
作者: Matthew Landers,Taylor W. Killian,Thomas Hartvigsen,Afsaneh Doryab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The combinatorial structure of many real-world action spaces leads to exponential growth in the number of possible actions, limiting the effectiveness of conventional reinforcement learning algorithms. Recent approaches for combinatorial action spaces impose factorized or sequential structures over sub-actions, failing to capture complex joint behavior. We introduce the Sub-Action Interaction Network using Transformers (SAINT), a novel policy architecture that represents multi-component actions as unordered sets and models their dependencies via self-attention conditioned on the global state. SAINT is permutation-invariant, sample-efficient, and compatible with standard policy optimization algorithms. In 15 distinct combinatorial environments across three task domains, including environments with nearly 17 million joint actions, SAINT consistently outperforms strong baselines.
zh

[AI-143] Learning Probabilistic Temporal Logic Specifications for Stochastic Systems IJCAI’25

【速读】:该论文试图解决从样本轨迹中推断能够准确表征具有随机行为系统的形式化行为规范的问题,这类系统在强化学习(Reinforcement Learning, RL)和形式化验证中普遍存在。传统方法如线性时序逻辑(Linear Temporal Logic, LTL)无法有效处理此类随机性。论文提出的解决方案关键在于通过结合基于文法的枚举、搜索启发式、概率模型检测以及布尔集覆盖过程,从一组被分类为正例或负例的马尔可夫链中推断出简洁的概率线性时序逻辑(Probabilistic Linear Temporal Logic, PLTL)公式。

链接: https://arxiv.org/abs/2505.12107
作者: Rajarshi Roy,Yash Pote,David Parker,Marta Kwiatkowska
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: Full version of the paper that appears in IJCAI’25

点击查看摘要

Abstract:There has been substantial progress in the inference of formal behavioural specifications from sample trajectories, for example, using Linear Temporal Logic (LTL). However, these techniques cannot handle specifications that correctly characterise systems with stochastic behaviour, which occur commonly in reinforcement learning and formal verification. We consider the passive learning problem of inferring a Boolean combination of probabilistic LTL (PLTL) formulas from a set of Markov chains, classified as either positive or negative. We propose a novel learning algorithm that infers concise PLTL specifications, leveraging grammar-based enumeration, search heuristics, probabilistic model checking and Boolean set-cover procedures. We demonstrate the effectiveness of our algorithm in two use cases: learning from policies induced by RL algorithms and learning from variants of a probabilistic model. In both cases, our method automatically and efficiently extracts PLTL specifications that succinctly characterise the temporal differences between the policies or model variants.
zh

[AI-144] When the Left Foot Leads to the Right Path: Bridging Initial Prejudice and Trainability

【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在初始化阶段的统计特性与其可训练性及架构内在偏置之间的关系问题。其解决方案的关键在于建立初始猜测偏差(Initial-Guessing Bias, IGB)与之前均场(Mean-Field, MF)理论之间的理论联系,从而将网络对特定类别的偏好与快速且准确学习的条件相连接。这一联系得出一个反直觉的结论:优化可训练性的初始化必然带有偏差,而非中立。此外,研究还扩展了MF/IGB框架以适用于多节点激活函数,为设计确保在使用最大池化和平均池化层的架构中稳定优化的初始化方案提供了实用指导。

链接: https://arxiv.org/abs/2505.12096
作者: Alberto Bassi,Carlo Albert,Aurelien Lucchi,Marco Baity-Jesi,Emanuele Francazi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Understanding the statistical properties of deep neural networks (DNNs) at initialization is crucial for elucidating both their trainability and the intrinsic architectural biases they encode prior to data exposure. Mean-field (MF) analyses have demonstrated that the parameter distribution in randomly initialized networks dictates whether gradients vanish or explode. Concurrently, untrained DNNs were found to exhibit an initial-guessing bias (IGB), in which large regions of the input space are assigned to a single class. In this work, we derive a theoretical proof establishing the correspondence between IGB and previous MF theories, thereby connecting a network prejudice toward specific classes with the conditions for fast and accurate learning. This connection yields the counter-intuitive conclusion: the initialization that optimizes trainability is necessarily biased, rather than neutral. Furthermore, we extend the MF/IGB framework to multi-node activation functions, offering practical guidelines for designing initialization schemes that ensure stable optimization in architectures employing max- and average-pooling layers.
zh

[AI-145] Attribution Projection Calculus: A Novel Framework for Causal Inference in Bayesian Networks

【速读】:该论文试图解决在结构化贝叶斯网络中确定因果关系的问题,特别是如何准确地将特征归因于其对应的标签,同时处理混杂因素和虚假相关性。解决方案的关键在于引入 Attribution Projection Calculus (AP-Calculus),该框架通过分析中间节点在不同上下文中的双重角色(即作为混杂因子和去混杂因子),并建立分离函数以最大化中间表示之间的区分度,从而实现对特征-标签归因的最优分配。

链接: https://arxiv.org/abs/2505.12094
作者: M Ruhul Amin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注: *AI was used to improve Text and collecting Citations

点击查看摘要

Abstract:This paper introduces Attribution Projection Calculus (AP-Calculus), a novel mathematical framework for determining causal relationships in structured Bayesian networks. We investigate a specific network architecture with source nodes connected to destination nodes through intermediate nodes, where each input maps to a single label with maximum marginal probability. We prove that for each label, exactly one intermediate node acts as a deconfounder while others serve as confounders, enabling optimal attribution of features to their corresponding labels. The framework formalizes the dual nature of intermediate nodes as both confounders and deconfounders depending on the context, and establishes separation functions that maximize distinctions between intermediate representations. We demonstrate that the proposed network architecture is optimal for causal inference compared to alternative structures, including those based on Pearl’s causal framework. AP-Calculus provides a comprehensive mathematical foundation for analyzing feature-label attributions, managing spurious correlations, quantifying information gain, ensuring fairness, and evaluating uncertainty in prediction models, including large language models. Theoretical verification shows that AP-Calculus not only extends but can also subsume traditional do-calculus for many practical applications, offering a more direct approach to causal inference in supervised learning contexts.
zh

[AI-146] SepPrune: Structured Pruning for Efficient Deep Speech Separation

【速读】:该论文旨在解决深度语音分离模型在实际应用中计算效率低的问题,即现有研究过于关注分离质量而忽视了模型的计算成本,这限制了其在低延迟场景中的应用。解决方案的关键在于提出SepPrune,这是一个专门设计用于压缩深度语音分离模型并降低计算成本的结构化剪枝框架。SepPrune通过分析模型的计算结构以识别高计算负担的层,并引入可微分掩码策略实现梯度驱动的通道选择,进而剪除冗余通道并微调剩余参数以恢复性能。

链接: https://arxiv.org/abs/2505.12079
作者: Yuqi Li,Kai Li,Xin Yin,Zhifei Yang,Junhao Dong,Zeyu Dong,Chuanguang Yang,Yingli Tian,Yao Lu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Although deep learning has substantially advanced speech separation in recent years, most existing studies continue to prioritize separation quality while overlooking computational efficiency, an essential factor for low-latency speech processing in real-time applications. In this paper, we propose SepPrune, the first structured pruning framework specifically designed to compress deep speech separation models and reduce their computational cost. SepPrune begins by analyzing the computational structure of a given model to identify layers with the highest computational burden. It then introduces a differentiable masking strategy to enable gradient-driven channel selection. Based on the learned masks, SepPrune prunes redundant channels and fine-tunes the remaining parameters to recover performance. Extensive experiments demonstrate that this learnable pruning paradigm yields substantial advantages for channel pruning in speech separation models, outperforming existing methods. Notably, a model pruned with SepPrune can recover 85% of the performance of a pre-trained model (trained over hundreds of epochs) with only one epoch of fine-tuning, and achieves convergence 36 \times faster than training from scratch. Code is available at this https URL.
zh

[AI-147] CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

【速读】:该论文旨在解决放射学报告中错误检测与修正的评估缺乏统一基准的问题,从而推动人工智能辅助的临床质量控制。其关键解决方案是构建了一个名为CorBenchX的综合性工具集,通过合成大规模包含常见临床错误的胸部X光报告数据集,并利用该数据集对多种视觉-语言模型进行零样本提示下的错误检测与修正性能评估,同时提出一种多步骤强化学习框架以优化多目标奖励,提升模型在单错误检测与修正任务上的表现。

链接: https://arxiv.org/abs/2505.12057
作者: Jing Zou,Qingqiu Li,Chenyu Lian,Lihao Liu,Xiaohan Yan,Shujun Wang,Jing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5figures

点击查看摘要

Abstract:AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best performance, with 50.6 % detection accuracy and correction scores of BLEU 0.853, ROUGE 0.924, BERTScore 0.981, SembScore 0.865, and CheXbertF1 0.954, remaining below clinical-level accuracy, highlighting the challenge of precise report correction. To advance the state of the art, we propose a multi-step reinforcement learning (MSRL) framework that optimizes a multi-objective reward combining format compliance, error-type accuracy, and BLEU similarity. We apply MSRL to QwenVL2.5-7B, the top open-source model in our benchmark, achieving an improvement of 38.3% in single-error detection precision and 5.2% in single-error correction over the zero-shot baseline.
zh

[AI-148] Beyond Scalar Rewards: An Axiomatic Framework for Lexicographic MDPs

【速读】:该论文试图解决在马尔可夫决策过程(Markov Decision Processes, MDPs)中,当偏好无法由标量奖励(scalar rewards)表示时,如何构建多维奖励函数的问题。其解决方案的关键在于识别出一个简单且实用的条件,该条件表明在某些情况下,必须使用二维或更高维度的奖励函数来准确建模偏好,而非依赖传统的标量奖励。论文进一步对这类奖励函数进行了完整表征,并分析了在最优策略层面,多维奖励设置下策略仍保留了许多标量奖励设定下的理想性质,而在约束马尔可夫决策过程(Constrained MDP, CMDP)中则不成立。

链接: https://arxiv.org/abs/2505.12049
作者: Mehran Shakerinava,Siamak Ravanbakhsh,Adam Oberman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has formalized the reward hypothesis through the lens of expected utility theory, by interpreting reward as utility. Hausner’s foundational work showed that dropping the continuity axiom leads to a generalization of expected utility theory where utilities are lexicographically ordered vectors of arbitrary dimension. In this paper, we extend this result by identifying a simple and practical condition under which preferences cannot be represented by scalar rewards, necessitating a 2-dimensional reward function. We provide a full characterization of such reward functions, as well as the general d-dimensional case, in Markov Decision Processes (MDPs) under a memorylessness assumption on preferences. Furthermore, we show that optimal policies in this setting retain many desirable properties of their scalar-reward counterparts, while in the Constrained MDP (CMDP) setting – another common multiobjective setting – they do not.
zh

[AI-149] Safe Delta: Consistently Preserving Safety when Fine-Tuning LLM s on Diverse Datasets ICML2025

【速读】:该论文试图解决用户上传数据在微调大型语言模型(Large Language Models, LLMs)过程中可能导致模型对齐失效,从而产生不安全输出的安全威胁问题。解决方案的关键在于提出Safe Delta,这是一种面向安全的后训练防御方法,通过调整微调前后的参数变化(即delta参数),估计安全性能的下降,选择在保证总体安全损失受限的前提下最大化实用性的参数,并应用安全补偿向量以减轻剩余的安全损失。

链接: https://arxiv.org/abs/2505.12038
作者: Ning Lu,Shengcai Liu,Jiahao Wu,Weiyu Chen,Zhirui Zhang,Yew-Soon Ong,Qi Wang,Ke Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ICML 2025 Camera Ready

点击查看摘要

Abstract:Large language models (LLMs) have shown great potential as general-purpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model’s alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss. Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.
zh

[AI-150] LLM -based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation

【速读】:该论文旨在解决自动化定理证明中训练数据多样性不足以及树搜索过程中探索与利用之间平衡的问题。其解决方案的关键在于提出了一种新颖的证明状态探索方法,用于生成大规模合成数据,以在广泛的中间证明状态下产生多样化的策略,从而实现对大型语言模型(LLM)作为策略模型的有效一次性微调。此外,还引入了一种自适应束宽策略,以有效利用合成数据方法,并在树搜索中实现探索与利用的平衡。

链接: https://arxiv.org/abs/2505.12031
作者: Junyu Lai,Jiakun Zhang,Shuo Xu,Taolue Chen,Zihang Wang,Yao Yang,Jiarui Zhang,Chun Cao,Jingwei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have sparked considerable interest in automated theorem proving and a prominent line of research integrates stepwise LLM-based provers into tree search. In this paper, we introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states, thereby facilitating effective one-shot fine-tuning of LLM as the policy model. We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search. Evaluations on the MiniF2F and ProofNet benchmarks demonstrate that our method outperforms strong baselines under the stringent Pass@1 metric, attaining an average pass rate of 60.74% on MiniF2F and 21.18% on ProofNet. These results underscore the impact of large-scale synthetic data in advancing automated theorem proving.
zh

[AI-151] GeoMaNO: Geometric Mamba Neural Operator for Partial Differential Equations

【速读】:该论文旨在解决基于Transformer的神经算子(Neural Operator, NO)在求解偏微分方程(PDEs)时存在的二次复杂度高、缺乏几何严谨性以及在规则网格上表现不佳的问题。其解决方案的关键在于提出几何Mamba神经算子(GeoMaNO)框架,该框架结合了Mamba模型的线性复杂度和建模能力,同时引入了几何严谨性,从而提升了神经算子在PDE求解中的性能。

链接: https://arxiv.org/abs/2505.12020
作者: Xi Han,Jingwei Zhang,Dimitris Samaras,Fei Hou,Hong Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The neural operator (NO) framework has emerged as a powerful tool for solving partial differential equations (PDEs). Recent NOs are dominated by the Transformer architecture, which offers NOs the capability to capture long-range dependencies in PDE dynamics. However, existing Transformer-based NOs suffer from quadratic complexity, lack geometric rigor, and thus suffer from sub-optimal performance on regular grids. As a remedy, we propose the Geometric Mamba Neural Operator (GeoMaNO) framework, which empowers NOs with Mamba’s modeling capability, linear complexity, plus geometric rigor. We evaluate GeoMaNO’s performance on multiple standard and popularly employed PDE benchmarks, spanning from Darcy flow problems to Navier-Stokes problems. GeoMaNO improves existing baselines in solution operator approximation by as much as 58.9%.
zh

[AI-152] Empowering Sustainable Finance with Artificial Intelligence: A Framework for Responsible Implementation

【速读】:该论文试图解决将人工智能(Artificial Intelligence, AI)技术与环境、社会及治理(Environmental, Social, and Governance, ESG)投资相结合过程中出现的规范性与风险控制问题。其解决方案的关键在于建立新的原则和规则,以确保AI在可持续金融决策中的应用能够兼顾合法性、监督与验证、透明度和可解释性,并强调在AI与ESG投资领域需确立基础性指导原则,以提升责任意识并有效管理潜在风险。

链接: https://arxiv.org/abs/2505.12012
作者: Georgios Pavlidis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This chapter explores the convergence of two major developments: the rise of environmental, social, and governance (ESG) investing and the exponential growth of artificial intelligence (AI) technology. The increased demand for diverse ESG instruments, such as green and ESG-linked loans, will be aligned with the rapid growth of the global AI market, which is expected to be worth 1,394.30 billion by 2029. AI can assist in identifying and pricing climate risks, setting more ambitious ESG goals, and advancing sustainable finance decisions. However, delegating sustainable finance decisions to AI poses serious risks, and new principles and rules for AI and ESG investing are necessary to mitigate these risks. This chapter highlights the challenges associated with norm-setting initiatives and stresses the need for the fine-tuning of the principles of legitimacy, oversight and verification, transparency, and explainability. Finally, the chapter contends that integrating AI into ESG non-financial reporting necessitates a heightened sense of responsibility and the establishment of fundamental guiding principles within the spheres of AI and ESG investing.
zh

[AI-153] SOCIA: An End-to-End Agent ic Framework for Automated Cyber-Physical-Social Simulator Generation

【速读】:该论文试图解决传统Cyber-Physical-Social (CPS)模拟器开发中存在的人工劳动密集和数据校准复杂的问题。解决方案的关键在于提出SOCIA(Simulation Orchestration for Cyber-physical-social Intelligence and Agents),一个基于大型语言模型(Large Language Model, LLM)的多智能体系统,通过集中式编排管理器协调专门智能体完成数据理解、代码生成、仿真执行及迭代评估反馈等任务,从而实现高保真、可扩展的CPS模拟器自动化生成。

链接: https://arxiv.org/abs/2505.12006
作者: Yuncheng Hua,Ji Miao,Mehdi Jafari,Jianxiang Xie,Hao Xue,Flora D. Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 3 figures, 2 tables. The paper is under review

点击查看摘要

Abstract:This paper introduces SOCIA (Simulation Orchestration for Cyber-physical-social Intelligence and Agents), a novel end-to-end framework leveraging Large Language Model (LLM)-based multi-agent systems to automate the generation of high-fidelity Cyber-Physical-Social (CPS) simulators. Addressing the challenges of labor-intensive manual simulator development and complex data calibration, SOCIA integrates a centralized orchestration manager that coordinates specialized agents for tasks including data comprehension, code generation, simulation execution, and iterative evaluation-feedback loops. Through empirical evaluations across diverse CPS tasks, such as mask adoption behavior simulation (social), personal mobility generation (physical), and user modeling (cyber), SOCIA demonstrates its ability to produce high-fidelity, scalable simulations with reduced human intervention. These results highlight SOCIA’s potential to offer a scalable solution for studying complex CPS phenomena
zh

[AI-154] Interactional Fairness in LLM Multi-Agent Systems: An Evaluation Framework

【速读】:该论文试图解决在基于大语言模型的多智能体系统(LLM-MAS)中,如何评估和实现交互公平性(Interactional Fairness)的问题,特别是关注人际公平(IF)和信息公平(InfF)对智能体行为的影响。其解决方案的关键在于引入一种基于组织心理学的框架,并将组织公正研究中的工具如Colquitt的组织公正量表和关键事件技术适配到LLM-MAS中,以将公平性视为智能体交互的行为属性而非主观体验,从而为公平性审计和规范敏感对齐提供理论基础和实证方法。

链接: https://arxiv.org/abs/2505.12001
作者: Ruta Binkyte
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly used in multi-agent systems, questions of fairness should extend beyond resource distribution and procedural design to include the fairness of how agents communicate. Drawing from organizational psychology, we introduce a novel framework for evaluating Interactional fairness encompassing Interpersonal fairness (IF) and Informational fairness (InfF) in LLM-based multi-agent systems (LLM-MAS). We extend the theoretical grounding of Interactional Fairness to non-sentient agents, reframing fairness as a socially interpretable signal rather than a subjective experience. We then adapt established tools from organizational justice research, including Colquitt’s Organizational Justice Scale and the Critical Incident Technique, to measure fairness as a behavioral property of agent interaction. We validate our framework through a pilot study using controlled simulations of a resource negotiation task. We systematically manipulate tone, explanation quality, outcome inequality, and task framing (collaborative vs. competitive) to assess how IF influences agent behavior. Results show that tone and justification quality significantly affect acceptance decisions even when objective outcomes are held constant. In addition, the influence of IF vs. InfF varies with context. This work lays the foundation for fairness auditing and norm-sensitive alignment in LLM-MAS.
zh

[AI-155] MRGRP: Empowering Courier Route Prediction in Food Delivery Service with Multi-Relational Graph

【速读】:该论文旨在解决即时食品配送中快递员路径预测不准确的问题,该问题导致任务调度效率低下,影响快递员和用户的满意度以及平台的盈利能力。现有方法仅依赖有限的人工选择的任务特征,并忽略快递员偏好,而基于学习的方法未能充分捕捉影响快递员决策的多样化因素及其复杂关系。解决方案的关键在于提出一种基于多关系图的路径预测方法(Multi-Relational Graph-based Route Prediction, MRGRP),通过构建包含空间时间邻近性及取送关系的多关系图,并设计GraphFormer架构来捕捉复杂的连接关系,同时引入考虑快递员信息及动态距离与时间上下文的路径解码器,从而实现更精确的路径预测。

链接: https://arxiv.org/abs/2505.11999
作者: Chang Liu,Huan Yan,Hongjie Sui,Haomin Wen,Yuan Yuan,Yuyang Han,Hongsen Liao,Xuetao Ding,Jinghua Hao,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instant food delivery has become one of the most popular web services worldwide due to its convenience in daily life. A fundamental challenge is accurately predicting courier routes to optimize task dispatch and improve delivery efficiency. This enhances satisfaction for couriers and users and increases platform profitability. The current heuristic prediction method uses only limited human-selected task features and ignores couriers preferences, causing suboptimal results. Additionally, existing learning-based methods do not fully capture the diverse factors influencing courier decisions or the complex relationships among them. To address this, we propose a Multi-Relational Graph-based Route Prediction (MRGRP) method that models fine-grained correlations among tasks affecting courier decisions for accurate prediction. We encode spatial and temporal proximity, along with pickup-delivery relationships, into a multi-relational graph and design a GraphFormer architecture to capture these complex connections. We also introduce a route decoder that leverages courier information and dynamic distance and time contexts for prediction, using existing route solutions as references to improve outcomes. Experiments show our model achieves state-of-the-art route prediction on offline data from cities of various sizes. Deployed on the Meituan Turing platform, it surpasses the current heuristic algorithm, reaching a high route prediction accuracy of 0.819, essential for courier and user satisfaction in instant food delivery.
zh

[AI-156] Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在处理复杂任务时面临的解决方案准确性与计算效率之间的权衡问题,以及验证阶段引入的额外挑战。其关键解决方案是提出FlexiVe,一种新型生成式验证器,通过灵活分配验证预算策略,在快速可靠思维与细致慢思维之间动态平衡计算资源,并结合Solve-Detect-Verify流水线,实现高效的推理阶段扩展框架,从而提升LLM在测试时的推理能力。

链接: https://arxiv.org/abs/2505.11966
作者: Jianyuan Zhong,Zeju Li,Zhijian Xu,Xiangyu Wen,Kezhi Li,Qiang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency. The subsequent step of verification, while intended to improve performance, further complicates this landscape by introducing its own challenging trade-off: sophisticated Generative Reward Models (GenRMs) can be computationally prohibitive if naively integrated with LLMs at test-time, while simpler, faster methods may lack reliability. To overcome these challenges, we introduce FlexiVe, a novel generative verifier that flexibly balances computational resources between rapid, reliable fast thinking and meticulous slow thinking using a Flexible Allocation of Verification Budget strategy. We further propose the Solve-Detect-Verify pipeline, an efficient inference-time scaling framework that intelligently integrates FlexiVe, proactively identifying solution completion points to trigger targeted verification and provide focused solver feedback. Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench. Furthermore, on challenging mathematical reasoning benchmarks (AIME 2024, AIME 2025, and CNMO), our full approach outperforms baselines like self-consistency in reasoning accuracy and inference efficiency. Our system offers a scalable and effective solution to enhance LLM reasoning at test time.
zh

[AI-157] MARVEL: Multi-Agent RTL Vulnerability Extraction using Large Language Models

【速读】:该论文旨在解决硬件安全验证中存在挑战性和耗时的问题,通过引入一种统一的决策、工具使用和推理方法来提高验证效率与准确性。其解决方案的关键在于提出MARVEL框架,这是一个基于多智能体的大型语言模型(LLM)系统,能够模拟设计人员在RTL代码中查找安全漏洞的认知过程。MARVEL包含一个监督代理,负责制定系统级芯片(SoC)的安全策略,并将任务分配给执行代理,每个执行代理根据特定策略使用包括形式化工具、静态分析检查、仿真测试及基于LLM的检测方案在内的多种工具,以识别潜在的安全缺陷,并将结果反馈给监督代理进行进一步分析和确认。

链接: https://arxiv.org/abs/2505.11963
作者: Luca Collini,Baleegh Ahmad,Joey Ah-kiow,Ramesh Karri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Submitted for Peer Review

点击查看摘要

Abstract:Hardware security verification is a challenging and time-consuming task. For this purpose, design engineers may utilize tools such as formal verification, linters, and functional simulation tests, coupled with analysis and a deep understanding of the hardware design being inspected. Large Language Models (LLMs) have been used to assist during this task, either directly or in conjunction with existing tools. We improve the state of the art by proposing MARVEL, a multi-agent LLM framework for a unified approach to decision-making, tool use, and reasoning. MARVEL mimics the cognitive process of a designer looking for security vulnerabilities in RTL code. It consists of a supervisor agent that devises the security policy of the system-on-chips (SoCs) using its security documentation. It delegates tasks to validate the security policy to individual executor agents. Each executor agent carries out its assigned task using a particular strategy. Each executor agent may use one or more tools to identify potential security bugs in the design and send the results back to the supervisor agent for further analysis and confirmation. MARVEL includes executor agents that leverage formal tools, linters, simulation tests, LLM-based detection schemes, and static analysis-based checks. We test our approach on a known buggy SoC based on OpenTitan from the Hack@DATE competition. We find that 20 of the 48 issues reported by MARVEL pose security vulnerabilities.
zh

[AI-158] CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World

【速读】:该论文试图解决在真实世界条件下执行指令时,智能体面对动态不可预测环境、语言复杂性以及多样化目标时的适应性问题。现有研究多局限于静态环境和简单指令,难以评估智能体在更复杂场景中的表现。解决方案的关键是引入CrafText基准,这是一个包含3,924条指令和3,423个独特词汇的多模态环境评估基准,覆盖了Localization(定位)、Conditional(条件)、Building(构建)和Achievement(成就)任务,并提出了一种评估协议,以衡量智能体在新颖指令形式和动态任务配置下的泛化能力,从而提供对语言理解和自适应决策能力的严格测试。

链接: https://arxiv.org/abs/2505.11962
作者: Zoya Volovikova,Gregory Gorbov,Petr Kuderov,Aleksandr I. Panov,Alexey Skrynnik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Following instructions in real-world conditions requires the ability to adapt to the world’s volatility and entanglement: the environment is dynamic and unpredictable, instructions can be linguistically complex with diverse vocabulary, and the number of possible goals an agent may encounter is vast. Despite extensive research in this area, most studies are conducted in static environments with simple instructions and a limited vocabulary, making it difficult to assess agent performance in more diverse and challenging settings. To address this gap, we introduce CrafText, a benchmark for evaluating instruction following in a multimodal environment with diverse instructions and dynamic interactions. CrafText includes 3,924 instructions with 3,423 unique words, covering Localization, Conditional, Building, and Achievement tasks. Additionally, we propose an evaluation protocol that measures an agent’s ability to generalize to novel instruction formulations and dynamically evolving task configurations, providing a rigorous test of both linguistic understanding and adaptive decision-making.
zh

[AI-159] Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning

【速读】:该论文旨在解决机器遗忘(machine unlearning)中损失重加权(loss reweighting)方法的功能机制不明确以及最优策略尚未确定的问题,这些问题限制了现有方法的理解与改进。论文的核心贡献在于识别出损失重加权的两个不同目标:饱和性(Saturation)和重要性(Importance),并设计相应的重加权策略进行验证。其关键在于通过结合饱和性和重要性两种机制,提出了一种名为SatImp的简单重加权方法,该方法在多个基准数据集上表现出色,有效提升了机器遗忘的效果。

链接: https://arxiv.org/abs/2505.11953
作者: Puning Yang,Qizhou Wang,Zhuo Huang,Tongliang Liu,Chengqi Zhang,Bo Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Loss reweighting has shown significant benefits for machine unlearning with large language models (LLMs). However, their exact functionalities are left unclear and the optimal strategy remains an open question, thus impeding the understanding and improvement of existing methodologies. In this paper, we identify two distinct goals of loss reweighting, namely, Saturation and Importance – the former indicates that those insufficiently optimized data should be emphasized, while the latter stresses some critical data that are most influential for loss minimization. To study their usefulness, we design specific reweighting strategies for each goal and evaluate their respective effects on unlearning. We conduct extensive empirical analyses on well-established benchmarks, and summarize some important observations as follows: (i) Saturation enhances efficacy more than importance-based reweighting, and their combination can yield additional improvements. (ii) Saturation typically allocates lower weights to data with lower likelihoods, whereas importance-based reweighting does the opposite. (iii) The efficacy of unlearning is also largely influenced by the smoothness and granularity of the weight distributions. Based on these findings, we propose SatImp, a simple reweighting method that combines the advantages of both saturation and importance. Empirical results on extensive datasets validate the efficacy of our method, potentially bridging existing research gaps and indicating directions for future research. Our code is available at this https URL.
zh

[AI-160] Lets have a chat with the EU AI Act

【速读】:该论文旨在解决人工智能开发者在日益复杂和动态变化的监管环境中确保合规性的挑战,特别是针对欧盟《人工智能法案》及相关标准的遵循问题。解决方案的关键在于开发一种基于检索增强生成(Retrieval-Augmented Generation, RAG)框架的AI驱动自我评估聊天机器人,该系统能够实时检索相关法规文本并提供定制化指导,从而简化合规流程、降低复杂性并促进负责任的AI发展。

链接: https://arxiv.org/abs/2505.11946
作者: Adam Kovari,Yasin Ghafourian,Csaba Hegedus,Belal Abu Naim,Kitti Mezei,Pal Varga,Markus Tauber
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) regulations evolve and the regulatory landscape develops and becomes more complex, ensuring compliance with ethical guidelines and legal frameworks remains a challenge for AI developers. This paper introduces an AI-driven self-assessment chatbot designed to assist users in navigating the European Union AI Act and related standards. Leveraging a Retrieval-Augmented Generation (RAG) framework, the chatbot enables real-time, context-aware compliance verification by retrieving relevant regulatory texts and providing tailored guidance. By integrating both public and proprietary standards, it streamlines regulatory adherence, reduces complexity, and fosters responsible AI development. The paper explores the chatbot’s architecture, comparing naive and graph-based RAG models, and discusses its potential impact on AI governance.
zh

[AI-161] LifelongAgent Bench: Evaluating LLM Agents as Lifelong Learners

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在动态环境中缺乏持续学习能力的问题,即这些智能体无法积累或迁移随时间变化的知识。现有基准测试将智能体视为静态系统,未能有效评估其持续学习能力。论文提出的解决方案是构建LifelongAgentBench,这是一个首个统一的基准,用于系统评估LLM智能体的持续学习能力,其关键在于提供基于技能的任务、自动标签验证、可复现性和模块化扩展性,同时引入群体自洽机制以显著提升持续学习性能。

链接: https://arxiv.org/abs/2505.11942
作者: Junhao Zheng,Xidi Cai,Qiuke Li,Duzhen Zhang,ZhongZhi Li,Yingying Zhang,Le Song,Qianli Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill-grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self-consistency mechanism that significantly improves lifelong learning performance. We hope LifelongAgentBench will advance the development of adaptive, memory-capable LLM agents.
zh

[AI-162] How can Diffusion Models Evolve into Continual Generators?

【速读】:该论文试图解决扩散模型在流式或持续学习(Continual Learning, CL)场景中面临的关键问题——灾难性遗忘(Catastrophic Forgetting, CF),即新获得的生成能力会覆盖之前学习的知识。解决方案的关键在于提出了一种系统性的持续扩散生成(Continual Diffusion Generation, CDG)范式,并构建了首个理论框架来分析基于扩散的生成建模中的跨任务动态。该框架通过引入三个关键的一致性准则:任务间知识一致性(Inter-task Knowledge Consistency, IKC)、无条件知识一致性(Unconditional Knowledge Consistency, UKC)和标签知识一致性(Label Knowledge Consistency, LKC),并将其整合到训练过程中,形成一种基于一致性的持续扩散(Continual Consistency Diffusion, CCD)方法,从而实现有效知识保留与新生成能力的融合。

链接: https://arxiv.org/abs/2505.11936
作者: Jingren Liu,Zhong Ji,Xiangyu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While diffusion models have achieved remarkable success in static data generation, their deployment in streaming or continual learning (CL) scenarios faces a major challenge: catastrophic forgetting (CF), where newly acquired generative capabilities overwrite previously learned ones. To systematically address this, we introduce a formal Continual Diffusion Generation (CDG) paradigm that characterizes and redefines CL in the context of generative diffusion models. Prior efforts often adapt heuristic strategies from continual classification tasks but lack alignment with the underlying diffusion process. In this work, we develop the first theoretical framework for CDG by analyzing cross-task dynamics in diffusion-based generative modeling. Our analysis reveals that the retention and stability of generative knowledge across tasks are governed by three key consistency criteria: inter-task knowledge consistency (IKC), unconditional knowledge consistency (UKC), and label knowledge consistency (LKC). Building on these insights, we propose Continual Consistency Diffusion (CCD), a principled framework that integrates these consistency objectives into training via hierarchical loss terms \mathcalL_IKC , \mathcalL_UKC , and \mathcalL_LKC . This promotes effective knowledge retention while enabling the assimilation of new generative capabilities. Extensive experiments on four benchmark datasets demonstrate that CCD achieves state-of-the-art performance under continual settings, with substantial gains in Mean Fidelity (MF) and Incremental Mean Fidelity (IMF), particularly in tasks with rich cross-task knowledge overlap.
zh

[AI-163] Conversational Recommendation System using NLP and Sentiment Analysis

【速读】:该论文试图解决传统推荐系统在处理个性化和上下文感知推荐时,难以有效利用对话数据的问题。其解决方案的关键在于将对话洞察整合到推荐过程中,通过融合深度学习技术(如Apriori算法、卷积神经网络、循环神经网络和长短期记忆网络)以及先进的语音识别技术(如隐马尔可夫模型和动态时间规整算法),实现更精准的语音到文本转换,并结合内容推荐与协同过滤方法,提升推荐系统的个性化和上下文感知能力。

链接: https://arxiv.org/abs/2505.11933
作者: Piyush Talegaonkar,Siddhant Hole,Shrinesh Kamble,Prashil Gulechha,Deepali Salapurkar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Presented in ISETE conference (International Conference on Artificial Intelligence, Machine Learning and Big Data Engineering 2024)

点击查看摘要

Abstract:In today’s digitally-driven world, the demand for personalized and context-aware recommendations has never been greater. Traditional recommender systems have made significant strides in this direction, but they often lack the ability to tap into the richness of conversational data. This paper represents a novel approach to recommendation systems by integrating conversational insights into the recommendation process. The Conversational Recommender System integrates cutting-edge technologies such as deep learning, leveraging machine learning algorithms like Apriori for Association Rule Mining, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LTSM). Furthermore, sophisticated voice recognition technologies, including Hidden Markov Models (HMMs) and Dynamic Time Warping (DTW) algorithms, play a crucial role in accurate speech-to-text conversion, ensuring robust performance in diverse environments. The methodology incorporates a fusion of content-based and collaborative recommendation approaches, enhancing them with NLP techniques. This innovative integration ensures a more personalized and context-aware recommendation experience, particularly in marketing applications.
zh

[AI-164] he Logical Expressiveness of Temporal GNNs via Two-Dimensional Product Logics

【速读】:该论文试图解决如何从逻辑角度表征时间图神经网络(temporal GNNs)的表达能力问题,特别是分析其在结合空间(图结构)与时间(随时间演化)维度时的特性。解决方案的关键在于将时间GNNs与二维乘积逻辑(two-dimensional product logics)相联系,通过分析图结构和时间组件的不同组合方式,揭示不同架构在表达能力上的差异,从而为时间GNNs提供首个逻辑表征,并确立其相对表达能力的新结果。

链接: https://arxiv.org/abs/2505.11930
作者: Marco Sälzer,Przemysław Andrzej Wałęga,Martin Lange
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:In recent years, the expressive power of various neural architectures – including graph neural networks (GNNs), transformers, and recurrent neural networks – has been characterised using tools from logic and formal language theory. As the capabilities of basic architectures are becoming well understood, increasing attention is turning to models that combine multiple architectural paradigms. Among them particularly important, and challenging to analyse, are temporal extensions of GNNs, which integrate both spatial (graph-structure) and temporal (evolution over time) dimensions. In this paper, we initiate the study of logical characterisation of temporal GNNs by connecting them to two-dimensional product logics. We show that the expressive power of temporal GNNs depends on how graph and temporal components are combined. In particular, temporal GNNs that apply static GNNs recursively over time can capture all properties definable in the product logic of (past) propositional temporal logic PTL and the modal logic K. In contrast, architectures such as graph-and-time TGNNs and global TGNNs can only express restricted fragments of this logic, where the interaction between temporal and spatial operators is syntactically constrained. These results yield the first logical characterisations of temporal GNNs and establish new relative expressiveness results for temporal GNNs.
zh

[AI-165] Modèles de Substitution pour les Modèles à base dAgents : Enjeux Méthodes et Applications

【速读】:该论文旨在解决基于代理的模型(Agent-based Models, ABM)在大规模模拟中计算成本高、难以实时决策和大规模场景分析的问题。其关键解决方案是采用代理模型(surrogate models),通过从稀疏的仿真数据中学习近似函数,以低成本的方式提供准确的预测,从而显著降低计算开销。文中探讨了多种机器学习技术在构建稳健代理模型中的应用,并强调了不确定性量化和敏感性分析在提升模型可靠性与可解释性中的作用。

链接: https://arxiv.org/abs/2505.11912
作者: Paul Saves,Nicolas Verstaevel,Benoît Gaudou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 12 pages, in French language. Les 33èmes Journées Francophones sur les Systèmes Multi-Agents (JFSMA 2025). 2025

点击查看摘要

Abstract:Multi-agent simulations enables the modeling and analyses of the dynamic behaviors and interactions of autonomous entities evolving in complex environments. Agent-based models (ABM) are widely used to study emergent phenomena arising from local interactions. However, their high computational cost poses a significant challenge, particularly for large-scale simulations requiring extensive parameter exploration, optimization, or uncertainty quantification. The increasing complexity of ABM limits their feasibility for real-time decision-making and large-scale scenario analysis. To address these limitations, surrogate models offer an efficient alternative by learning approximations from sparse simulation data. These models provide cheap-to-evaluate predictions, significantly reducing computational costs while maintaining accuracy. Various machine learning techniques, including regression models, neural networks, random forests and Gaussian processes, have been applied to construct robust surrogates. Moreover, uncertainty quantification and sensitivity analysis play a crucial role in enhancing model reliability and interpretability. This article explores the motivations, methods, and applications of surrogate modeling for ABM, emphasizing the trade-offs between accuracy, computational efficiency, and interpretability. Through a case study on a segregation model, we highlight the challenges associated with building and validating surrogate models, comparing different approaches and evaluating their performance. Finally, we discuss future perspectives on integrating surrogate models within ABM to improve scalability, explainability, and real-time decision support across various fields such as ecology, urban planning and economics. Comments: 12 pages, in French language. Les 33èmes Journées Francophones sur les Systèmes Multi-Agents (JFSMA 2025). 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2505.11912 [cs.LG] (or arXiv:2505.11912v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11912 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-166] K*-Means: A Parameter-free Clustering Algorithm

【速读】:该论文试图解决传统聚类方法中需要预先指定聚类数量(k)或依赖隐式确定k的阈值所带来的限制问题。解决方案的关键在于提出一种名为k*-means的新颖聚类算法,该算法通过最小描述长度原则(Minimum Description Length Principle)自动确定最优聚类数量k*,而无需人工设定k或其他参数,同时在分裂和合并聚类的过程中优化标准k-means目标函数。

链接: https://arxiv.org/abs/2505.11904
作者: Louis Mahon,Mirella Lapata
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Clustering is a widely used and powerful machine learning technique, but its effectiveness is often limited by the need to specify the number of clusters, k, or by relying on thresholds that implicitly determine k. We introduce k*-means, a novel clustering algorithm that eliminates the need to set k or any other parameters. Instead, it uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters while also optimising the standard k-means objective. We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown. We also show that it is accurate in estimating k, and that empirically its runtime is competitive with existing methods, and scales well with dataset size.
zh

[AI-167] From Recall to Reasoning : Automated Question Generation for Deeper Math Learning through Large Language Models

【速读】:该论文试图解决如何有效利用生成式 AI (Generative AI, GenAI) 来优化高等数学课程内容的生成问题,特别是生成高质量且与课程内容相关的练习题。研究的关键在于通过实证分析发现,当前公开可用的 GenAI 虽然能够在少量支持下生成不同质量的数学问题,但提供示例和相关教学内容能够显著提升输出质量,从而为教育工作者提供了一个改进框架,以更有效地将 GenAI 集成到其教学流程中。

链接: https://arxiv.org/abs/2505.11899
作者: Yongan Yu,Alexandre Krantz,Nikki G. Lobczowski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, accepted by AIED conference

点击查看摘要

Abstract:Educators have started to turn to Generative AI (GenAI) to help create new course content, but little is known about how they should do so. In this project, we investigated the first steps for optimizing content creation for advanced math. In particular, we looked at the ability of GenAI to produce high-quality practice problems that are relevant to the course content. We conducted two studies to: (1) explore the capabilities of current versions of publicly available GenAI and (2) develop an improved framework to address the limitations we found. Our results showed that GenAI can create math problems at various levels of quality with minimal support, but that providing examples and relevant content results in better quality outputs. This research can help educators decide the ideal way to adopt GenAI in their workflows, to create more effective educational experiences for students.
zh

[AI-168] AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对需要复杂推理的任务时,依赖链式思维(Chain-of-Thought, CoT)提示所带来的计算成本高和效率低的问题,尤其是在处理简单输入时仍生成冗长推理步骤的情况。其解决方案的关键在于提出AdaCoT(自适应链式思维)框架,通过将自适应推理建模为帕累托优化问题,平衡模型性能与CoT调用的成本,并采用基于强化学习(Reinforcement Learning, RL)的方法,特别是近端策略优化(Proximal Policy Optimization, PPO),动态调整CoT触发决策边界,从而根据查询的隐含复杂性判断是否需要调用CoT。此外,Selective Loss Masking (SLM) 是该方案的核心技术贡献,用于防止多阶段强化学习训练中的决策边界崩溃,确保自适应触发的鲁棒性和稳定性。

链接: https://arxiv.org/abs/2505.11896
作者: Chenwei Lou,Zewei Sun,Xinnian Liang,Meng Qu,Wei Shen,Wenqi Wang,Yuntao Li,Qingping Yang,Shuangzhi Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.
zh

[AI-169] AdaptMol: Adaptive Fusion from Sequence String to Topological Structure for Few-shot Drug Discovery

【速读】:该论文旨在解决在实验验证数据稀缺的情况下,如何提升生成式 AI 在分子性质预测(MPP)任务中的性能问题。其关键解决方案是提出 AdaptMol 框架,该框架通过集成自适应多模态融合(Adaptive multimodal fusion)机制,利用双重注意力机制动态整合来自 SMILES 序列和分子图的全局与局部分子特征,从而提升分子表示的质量和模型性能。

链接: https://arxiv.org/abs/2505.11878
作者: Yifan Dai(1),Xuanbai Ren(1),Tengfei Ma(1),Qipeng Yan(2),Yiping Liu(1),Yuansheng Liu(1),Xiangxiang Zeng(1) ((1) College of Computer Science and Electronic Engineering, Hunan University, (2) School of Biomedical Science, Hunan University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Molecular Networks (q-bio.MN)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Accurate molecular property prediction (MPP) is a critical step in modern drug development. However, the scarcity of experimental validation data poses a significant challenge to AI-driven research paradigms. Under few-shot learning scenarios, the quality of molecular representations directly dictates the theoretical upper limit of model performance. We present AdaptMol, a prototypical network integrating Adaptive multimodal fusion for Molecular representation. This framework employs a dual-level attention mechanism to dynamically integrate global and local molecular features derived from two modalities: SMILES sequences and molecular graphs. (1) At the local level, structural features such as atomic interactions and substructures are extracted from molecular graphs, emphasizing fine-grained topological information; (2) At the global level, the SMILES sequence provides a holistic representation of the molecule. To validate the necessity of multimodal adaptive fusion, we propose an interpretable approach based on identifying molecular active substructures to demonstrate that multimodal adaptive fusion can efficiently represent molecules. Extensive experiments on three commonly used benchmarks under 5-shot and 10-shot settings demonstrate that AdaptMol achieves state-of-the-art performance in most cases. The rationale-extracted method guides the fusion of two modalities and highlights the importance of both modalities.
zh

[AI-170] Position Paper: Bounded Alignment: What (Not) To Expect From AGI Agents

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)领域对人工通用智能(Artificial General Intelligence, AGI)的愿景与安全评估标准存在的不足问题。论文认为,当前AI/ML社区对AGI的主流观点需要转变,其关键在于将对AGI的安全期望和评估指标更多地基于对现存唯一具有通用智能的实例——即动物,尤其是人类智能的理解。这种视角的转变有助于形成更现实的技术认知,并促进更有效的政策制定。

链接: https://arxiv.org/abs/2505.11866
作者: Ali A. Minai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper accepted for the 2025 IEEE/INNS International Joint Conference on Neural Networks, Rome, Italy, June 30 - July 5, 2025

点击查看摘要

Abstract:The issues of AI risk and AI safety are becoming critical as the prospect of artificial general intelligence (AGI) looms larger. The emergence of extremely large and capable generative models has led to alarming predictions and created a stir from boardrooms to legislatures. As a result, AI alignment has emerged as one of the most important areas in AI research. The goal of this position paper is to argue that the currently dominant vision of AGI in the AI and machine learning (AI/ML) community needs to evolve, and that expectations and metrics for its safety must be informed much more by our understanding of the only existing instance of general intelligence, i.e., the intelligence found in animals, and especially in humans. This change in perspective will lead to a more realistic view of the technology, and allow for better policy decisions.
zh

[AI-171] Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning

【速读】:该论文试图解决生成式 AI(Generative AI)在行为上与复杂人类价值观对齐的问题,现有方法通常通过简化为标量奖励来表征人类意图,忽略了人类反馈的多维特性。其解决方案的关键在于引入基于偏好的多目标逆强化学习(preference-based Multi-Objective Inverse Reinforcement Learning, MO-IRL)理论框架,将人类偏好建模为潜在的向量值奖励函数,并通过恢复帕累托最优奖励表示来识别多目标结构,从而实现更精确的行为对齐。

链接: https://arxiv.org/abs/2505.11864
作者: Kalyan Cherukuri,Aarav Lala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注:

点击查看摘要

Abstract:As generative agents become increasingly capable, alignment of their behavior with complex human values remains a fundamental challenge. Existing approaches often simplify human intent through reduction to a scalar reward, overlooking the multi-faceted nature of human feedback. In this work, we introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions. We formalize the problem of recovering a Pareto-optimal reward representation from noisy preference queries and establish conditions for identifying the underlying multi-objective structure. We derive tight sample complexity bounds for recovering \epsilon -approximations of the Pareto front and introduce a regret formulation to quantify suboptimality in this multi-objective setting. Furthermore, we propose a provably convergent algorithm for policy optimization using preference-inferred reward cones. Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors in a high-dimension and value-pluralistic environment.
zh

[AI-172] Q-Policy: Quantum-Enhanced Policy Evaluation for Scalable Reinforcement Learning

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)在计算效率和可扩展性方面的挑战,特别是在传统方法难以有效处理大规模问题时。解决方案的关键在于提出Q-Policy,这是一个结合量子计算原语的混合量子-经典强化学习框架,通过量子叠加编码价值函数,利用幅值编码和量子并行性同时评估多个状态-动作对,从而在策略评估和优化步骤中实现数学上的加速。

链接: https://arxiv.org/abs/2505.11862
作者: Kalyan Cherukuri,Aarav Lala,Yash Yardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:We propose Q-Policy, a hybrid quantum-classical reinforcement learning (RL) framework that mathematically accelerates policy evaluation and optimization by exploiting quantum computing primitives. Q-Policy encodes value functions in quantum superposition, enabling simultaneous evaluation of multiple state-action pairs via amplitude encoding and quantum parallelism. We introduce a quantum-enhanced policy iteration algorithm with provable polynomial reductions in sample complexity for the evaluation step, under standard assumptions. To demonstrate the technical feasibility and theoretical soundness of our approach, we validate Q-Policy on classical emulations of small discrete control tasks. Due to current hardware and simulation limitations, our experiments focus on showcasing proof-of-concept behavior rather than large-scale empirical evaluation. Our results support the potential of Q-Policy as a theoretical foundation for scalable RL on future quantum devices, addressing RL scalability challenges beyond classical approaches.
zh

[AI-173] Evaluating the Logical Reasoning Abilities of Large Reasoning Models

【速读】:该论文试图解决大型推理模型在逻辑推理能力方面的研究不足问题,特别是其在演绎、归纳、类比和溯因等多样化推理类型及任务格式中的表现不均衡问题。解决方案的关键在于引入LogiEval,这是一个全面的基准测试框架,涵盖多种推理类型和任务形式,并基于高质量的人类考试数据(如LSAT、GMAT)构建。此外,通过一种新的筛选范式,识别出LogiEval-Hard这一具有挑战性的子集,利用小型模型的失败模式可靠地预测大型模型的困难点,从而揭示了不同规模模型在基础推理瓶颈上的共性问题。

链接: https://arxiv.org/abs/2505.11854
作者: Hanmeng Liu,Yiran Ding,Zhizhang Fu,Chaoli Zhang,Xiaozhang Liu,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across reasoning types and formats, highlighting limitations in their generalization. Our analysis reveals that human performance does not mirror model failure distributions. To foster further research, we curate LogiEval-Hard, a challenging subset identified through a novel screening paradigm where small-model failures (Qwen3-30B-A3B) reliably predict difficulties for larger models. Modern models show striking, consistent failures on LogiEval-Hard. This demonstrates that fundamental reasoning bottlenecks persist across model scales, and establishes LogiEval-Hard as both a diagnostic tool and a rigorous testbed for advancing logical reasoning in LLMs.
zh

[AI-174] VeriReason : Reinforcement Learning with Testbench Feedback for Reasoning -Enhanced Verilog Generation

【速读】:该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的寄存器传输级(Register Transfer Level, RTL)代码生成中存在的训练数据稀缺、规范与代码对齐性差、缺乏验证机制以及泛化与专业化平衡等问题。其解决方案的关键在于提出VeriReason框架,该框架结合了监督微调与引导奖励近端优化(Guided Reward Proximal Optimization, GRPO)强化学习方法,通过精心筛选的训练示例和反馈驱动的奖励模型,将测试平台评估与结构启发式相结合,并嵌入自检能力以实现自主错误纠正。

链接: https://arxiv.org/abs/2505.11849
作者: Yiting Wang,Guoheng Sun,Wanghao Ye,Gang Qu,Ang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Automating Register Transfer Level (RTL) code generation using Large Language Models (LLMs) offers substantial promise for streamlining digital circuit design and reducing human effort. However, current LLM-based approaches face significant challenges with training data scarcity, poor specification-code alignment, lack of verification mechanisms, and balancing generalization with specialization. Inspired by DeepSeek-R1, we introduce VeriReason, a framework integrating supervised fine-tuning with Guided Reward Proximal Optimization (GRPO) reinforcement learning for RTL generation. Using curated training examples and a feedback-driven reward model, VeriReason combines testbench evaluations with structural heuristics while embedding self-checking capabilities for autonomous error correction. On the VerilogEval Benchmark, VeriReason delivers significant improvements: achieving 83.1% functional correctness on the VerilogEval Machine benchmark, substantially outperforming both comparable-sized models and much larger commercial systems like GPT-4 Turbo. Additionally, our approach demonstrates up to a 2.8X increase in first-attempt functional correctness compared to baseline methods and exhibits robust generalization to unseen designs. To our knowledge, VeriReason represents the first system to successfully integrate explicit reasoning capabilities with reinforcement learning for Verilog generation, establishing a new state-of-the-art for automated RTL synthesis. The models and datasets are available at: this https URL Code is Available at: this https URL
zh

[AI-175] On the Eligibility of LLM s for Counterfactual Reasoning : A Decompositional Study

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理反事实推理(Counterfactual Reasoning)时表现不佳的问题,特别是需要明确哪些因素在不同任务和模态中对模型性能产生最大影响。其解决方案的关键在于提出一种分解策略,将反事实生成过程从因果构建到反事实干预的推理进行分阶段分析,从而系统地评估模型在各个阶段的行为,并揭示模态类型和中间推理对性能的影响。

链接: https://arxiv.org/abs/2505.11839
作者: Shuai Yang,Qi Yang,Luoxi Tang,Jeremy Blackburn,Zhaohan Xi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate 11 datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.
zh

[AI-176] On Membership Inference Attacks in Knowledge Distillation

【速读】:该论文旨在解决知识蒸馏(knowledge distillation)过程中对成员推理攻击(Membership Inference Attacks, MIAs)的隐私保护问题。研究发现,尽管教师模型和学生模型在整体MIAs准确率上表现相似,但教师模型更有效地保护了成员数据,而学生模型则在非成员数据保护方面表现更优。为增强学生模型对MIAs的隐私保护能力,作者提出了五种隐私保护型蒸馏方法,并验证了这些方法能够有效降低学生模型对MIAs的脆弱性,其中集成方法进一步提升了模型的鲁棒性。关键在于通过改进蒸馏过程来提升学生模型的隐私安全性,从而实现高效且安全的模型压缩。

链接: https://arxiv.org/abs/2505.11837
作者: Ziyao Cui,Minxing Zhang,Jian Pei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Nowadays, Large Language Models (LLMs) are trained on huge datasets, some including sensitive information. This poses a serious privacy concern because privacy attacks such as Membership Inference Attacks (MIAs) may detect this sensitive information. While knowledge distillation compresses LLMs into efficient, smaller student models, its impact on privacy remains underexplored. In this paper, we investigate how knowledge distillation affects model robustness against MIA. We focus on two questions. First, how is private data protected in teacher and student models? Second, how can we strengthen privacy preservation against MIAs in knowledge distillation? Through comprehensive experiments, we show that while teacher and student models achieve similar overall MIA accuracy, teacher models better protect member data, the primary target of MIA, whereas student models better protect non-member data. To address this vulnerability in student models, we propose 5 privacy-preserving distillation methods and demonstrate that they successfully reduce student models’ vulnerability to MIA, with ensembling further stabilizing the robustness, offering a reliable approach for distilling more secure and efficient student models. Our implementation source code is available at this https URL.
zh

[AI-177] SplInterp: Improving our Understanding and Training of Sparse Autoencoders

【速读】:该论文试图解决稀疏自编码器(Sparse Autoencoder, SAE)在机制可解释性中的理论理解不足问题,以及其实际效用的争议。解决方案的关键在于将SAE置于深度学习的样条理论框架中进行分析,揭示其作为分段仿射映射的性质,并通过引入一种具有坚实理论基础的新型优化算法——近端交替随机梯度下降(Proximal Alternating Method SGD, PAM-SGD),提升SAE的训练效率与性能,尤其在样本效率和代码稀疏性方面表现突出。

链接: https://arxiv.org/abs/2505.11836
作者: Jeremy Budd,Javier Ideami,Benjamin Macdowall Rynne,Keith Duggar,Randall Balestriero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 44 pages, 38 figures, under review

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have received considerable recent attention as tools for mechanistic interpretability, showing success at extracting interpretable features even from very large LLMs. However, this research has been largely empirical, and there have been recent doubts about the true utility of SAEs. In this work, we seek to enhance the theoretical understanding of SAEs, using the spline theory of deep learning. By situating SAEs in this framework: we discover that SAEs generalise k -means autoencoders'' to be piecewise affine, but sacrifice accuracy for interpretability vs. the optimal k -means-esque plus local principal component analysis (PCA)‘’ piecewise affine autoencoder. We characterise the underlying geometry of (TopK) SAEs using power diagrams. And we develop a novel proximal alternating method SGD (PAM-SGD) algorithm for training SAEs, with both solid theoretical foundations and promising empirical results in MNIST and LLM experiments, particularly in sample efficiency and (in the LLM setting) improved sparsity of codes. All code is available at: this https URL
zh

[AI-178] oLeaP: Rethinking Development of Tool Learning with Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在工具学习(Tool Learning)能力上的关键挑战,包括基准测试的局限性、自主学习能力不足、泛化能力有限以及长时序任务求解能力薄弱。其解决方案的关键在于构建一个名为ToLeaP的工具学习平台,通过重现33个基准测试并为7个模型提供一键式评估,同时收集21个潜在训练数据集以支持后续研究。基于该平台对41个LLMs的3000多个失败案例进行分析,论文进一步提出了四个潜在的研究方向,包括真实世界基准构建、兼容性感知的自主学习、通过思考进行的推理学习以及关键线索的识别与回忆。

链接: https://arxiv.org/abs/2505.11833
作者: Haotian Chen,Zijun Song,Boye Niu,Ke Zhang,Litu Ou,Yaxi Lu,Zhong Zhang,Xin Cong,Yankai Lin,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool learning, which enables large language models (LLMs) to utilize external tools effectively, has garnered increasing attention for its potential to revolutionize productivity across industries. Despite rapid development in tool learning, key challenges and opportunities remain understudied, limiting deeper insights and future advancements. In this paper, we investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. We also collect 21 out of 33 potential training datasets to facilitate future exploration. After analyzing over 3,000 bad cases of 41 LLMs based on ToLeaP, we identify four main critical challenges: (1) benchmark limitations induce both the neglect and lack of (2) autonomous learning, (3) generalization, and (4) long-horizon task-solving capabilities of LLMs. To aid future advancements, we take a step further toward exploring potential directions, namely (1) real-world benchmark construction, (2) compatibility-aware autonomous learning, (3) rationale learning by thinking, and (4) identifying and recalling key clues. The preliminary experiments demonstrate their effectiveness, highlighting the need for further research and exploration.
zh

[AI-179] ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

【速读】:该论文试图解决当前人工智能系统在评估其通用流体智能(fluid intelligence)方面缺乏细粒度和高认知复杂度基准的问题。解决方案的关键在于引入ARC-AGI-2,这是一个升级版的基准测试集,保留了原有输入-输出对任务格式以保证研究连续性,同时通过新设计和扩展的任务集提供更精细的信号,以评估抽象推理和问题解决能力在更高层次流体智能上的表现。

链接: https://arxiv.org/abs/2505.11831
作者: Francois Chollet,Mike Knoop,Gregory Kamradt,Bryan Landers,Henry Pinkard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark’s accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.
zh

[AI-180] Search-Based Correction of Reasoning Chains for Language Models

【速读】:该论文旨在解决链式推理(Chain-of-Thought, CoT)过程中可能包含不准确陈述的问题,这些问题会降低语言模型(Language Models, LMs)的性能和可信度。其解决方案的关键在于引入一种新的自校正框架,该框架在CoT的每个推理步骤中引入一个表示真实性的潜在变量,从而建模所有可能的真实性分配,而非假设整个推理过程始终正确。该方法通过Search Corrector——一种基于布尔值真实性分配的离散搜索算法——高效地探索扩展后的空间,并利用语言模型对真实性和最终答案的联合似然作为代理奖励,实现后验分布中的高效推断。

链接: https://arxiv.org/abs/2505.11824
作者: Minsu Kim,Jean-Pierre Falet,Oliver E. Richardson,Xiaoyin Chen,Moksh Jain,Sungjin Ahn,Sungsoo Ahn,Yoshua Bengio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we introduce a new self-correction framework that augments each reasoning step in a CoT with a latent variable indicating its veracity, enabling modeling of all possible truth assignments rather than assuming correctness throughout. To efficiently explore this expanded space, we introduce Search Corrector, a discrete search algorithm over boolean-valued veracity assignments. It efficiently performs otherwise intractable inference in the posterior distribution over veracity assignments by leveraging the LM’s joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time correction method facilitates supervised fine-tuning of an Amortized Corrector by providing pseudo-labels for veracity. The Amortized Corrector generalizes self-correction, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that Search Corrector reliably identifies errors in logical (ProntoQA) and mathematical reasoning (GSM8K) benchmarks. The Amortized Corrector achieves comparable zero-shot accuracy and improves final answer accuracy by up to 25%.
zh

[AI-181] ChatHTN: Interleaving Approximate (LLM ) and Symbolic HTN Planning

【速读】:该论文试图解决传统分层任务网络(Hierarchical Task Network, HTN)规划中任务分解的复杂性和灵活性不足的问题。解决方案的关键在于将符号化HTN规划技术与ChatGPT的查询相结合,通过混合生成由符号化HTN规划和ChatGPT产生的任务分解层次结构,从而近似得到有效的任务分解方案。尽管ChatGPT生成的结果具有一定的近似性,但ChatHTN在理论上是可靠的,其生成的任何计划都能正确实现输入任务。

链接: https://arxiv.org/abs/2505.11814
作者: Hector Munoz-Avila,David W. Aha,Paola Rizzo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 2nd International Conference on Neuro-symbolic Systems (NeuS) 2025

点击查看摘要

Abstract:We introduce ChatHTN, a Hierarchical Task Network (HTN) planner that combines symbolic HTN planning techniques with queries to ChatGPT to approximate solutions in the form of task decompositions. The resulting hierarchies interleave task decompositions generated by symbolic HTN planning with those generated by ChatGPT. Despite the approximate nature of the results generates by ChatGPT, ChatHTN is provably sound; any plan it generates correctly achieves the input tasks. We demonstrate this property with an open-source implementation of our system.
zh

[AI-182] VITA: Versatile Time Representation Learning for Temporal Hyper-Relational Knowledge Graphs

【速读】:该论文旨在解决时间知识图谱(Temporal Knowledge Graphs, TKGs)中事实的时间有效性建模不足的问题,特别是在处理具有无限有效期的事实以及从时间限定符中提取足够时间有效性信息方面存在局限。其解决方案的关键在于提出一种通用的时间表示学习方法VITA,该方法能够灵活适应所有四种类型的时间有效性(即“自某时起”、“至某时止”、“时间段”和“时间不变”),并通过有效学习时间值和时间跨度的信息来提升链接预测性能。

链接: https://arxiv.org/abs/2505.11803
作者: ChongIn Un,Yuhuan Lu,Tianyue Yang,Dingqi Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Knowledge graphs (KGs) have become an effective paradigm for managing real-world facts, which are not only complex but also dynamically evolve over time. The temporal validity of facts often serves as a strong clue in downstream link prediction tasks, which predicts a missing element in a fact. Traditional link prediction techniques on temporal KGs either consider a sequence of temporal snapshots of KGs with an ad-hoc defined time interval or expand a temporal fact over its validity period under a predefined time granularity; these approaches not only suffer from the sensitivity of the selection of time interval/granularity, but also face the computational challenges when handling facts with long (even infinite) validity. Although the recent hyper-relational KGs represent the temporal validity of a fact as qualifiers describing the fact, it is still suboptimal due to its ignorance of the infinite validity of some facts and the insufficient information encoded from the qualifiers about the temporal validity. Against this background, we propose VITA, a \underlineV ersatile t \underlineI me represen \underlineTA tion learning method for temporal hyper-relational knowledge graphs. We first propose a versatile time representation that can flexibly accommodate all four types of temporal validity of facts (i.e., since, until, period, time-invariant), and then design VITA to effectively learn the time information in both aspects of time value and timespan to boost the link prediction performance. We conduct a thorough evaluation of VITA compared to a sizable collection of baselines on real-world KG datasets. Results show that VITA outperforms the best-performing baselines in various link prediction tasks (predicting missing entities, relations, time, and other numeric literals) by up to 75.3%. Ablation studies and a case study also support our key design choices.
zh

[AI-183] Diffmv: A Unified Diffusion Framework for Healthcare Predictions with Random Missing Views and View Laziness KDD2025

【速读】:该论文旨在解决电子健康记录(Electronic Health Record, EHR)多视图利用中的两个关键问题:随机缺失视图和视图懒惰。针对随机缺失视图,其解决方案的关键在于将多种EHR视图整合到一个统一的扩散-去噪框架中,并通过丰富的上下文条件促进视图的逐步对齐与转换。为缓解视图懒惰问题,提出了一种新颖的重加权策略,用于评估各视图的相对优势,从而在模型中实现各类数据视图的平衡利用。

链接: https://arxiv.org/abs/2505.11802
作者: Chuang Zhao,Hui Tang,Hongke Zhao,Xiaomeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: SIGKDD2025, accepted

点击查看摘要

Abstract:Advanced healthcare predictions offer significant improvements in patient outcomes by leveraging predictive analytics. Existing works primarily utilize various views of Electronic Health Record (EHR) data, such as diagnoses, lab tests, or clinical notes, for model training. These methods typically assume the availability of complete EHR views and that the designed model could fully leverage the potential of each view. However, in practice, random missing views and view laziness present two significant challenges that hinder further improvements in multi-view utilization. To address these challenges, we introduce Diffmv, an innovative diffusion-based generative framework designed to advance the exploitation of multiple views of EHR data. Specifically, to address random missing views, we integrate various views of EHR data into a unified diffusion-denoising framework, enriched with diverse contextual conditions to facilitate progressive alignment and view transformation. To mitigate view laziness, we propose a novel reweighting strategy that assesses the relative advantages of each view, promoting a balanced utilization of various data views within the model. Our proposed strategy achieves superior performance across multiple health prediction tasks derived from three popular datasets, including multi-view and multi-modality scenarios.
zh

[AI-184] Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling

【速读】:该论文试图解决从自然语言描述中自动生成形式正确且可用的优化模型的问题,这一过程在使用大型语言模型(Large Language Models, LLMs)时常常因幻觉问题而受到挑战。解决方案的关键在于提出一种名为求解器感知强化学习(Solver-Informed Reinforcement Learning, SIRL)的新框架,该框架利用外部优化求解器作为可验证的奖励机制,以显著提升LLMs在优化建模中的真实性。通过经典优化求解器对生成的线性规划(LP)文件进行自动验证,SIRL能够提供精确的反馈信号,包括语法、可行性及解的质量,从而直接指导强化学习过程。

链接: https://arxiv.org/abs/2505.11792
作者: Yitian Chen,Jingfan Xia,Siyu Shao,Dongdong Ge,Yinyu Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimization modeling is fundamental to decision-making across diverse this http URL progress in automating optimization formulation from natural language descriptions, Large Language Models (LLMs) often struggle to generate formally correct and usable models due to hallucinations, posing a challenge for reliable automation. Inspired by the success of Reinforcement Learning (RL) in enhancing Large Reasoning Models, we present Solver-Informed Reinforcement Learning (SIRL).This novel framework leverages external optimization solvers as verifiable reward mechanisms to significantly improve the authenticity of LLMs for optimization this http URL as precise verifiers, these solvers automatically assess the executable code and the instance-level mathematical model represented by the associated LP file, yielding precise and comprehensive feedback signals – including syntax, feasibility, and solution quality that directly inform the RL process. This automated verification process, powered by classic optimization solvers, also underpins our instance-enhanced self-consistency method to synthesize high-quality training data. Extensive experiments on diverse public benchmarks demonstrate that SIRL achieves state-of-the-art performance, substantially outperforming existing methods in generating accurate and executable optimization models.
zh

[AI-185] Improving Coverag e in Combined Prediction Sets with Weighted p-values

【速读】:该论文试图解决在多模型、多数据源或多次试验场景下,如何有效聚合置信预测集以保持或提升整体覆盖率的问题。传统方法在聚合多个具有独立1-α覆盖率的预测集时,会导致整体覆盖率下降至1-2α,从而削弱了不确定性量化的效果。解决方案的关键在于提出一种加权聚合框架,通过根据每个预测集的贡献分配权重,实现对预测集聚合方式的灵活控制,从而获得介于单个模型的1-α覆盖率与联合模型的1-2α覆盖率之间的更紧致的覆盖率边界。此外,该框架还扩展至数据依赖性权重,并提出了保持有限样本有效性的通用聚合方法。

链接: https://arxiv.org/abs/2505.11785
作者: Gina Wong,Drew Prinster,Suchi Saria,Rama Chellappa,Anqi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Conformal prediction quantifies the uncertainty of machine learning models by augmenting point predictions with valid prediction sets, assuming exchangeability. For complex scenarios involving multiple trials, models, or data sources, conformal prediction sets can be aggregated to create a prediction set that captures the overall uncertainty, often improving precision. However, aggregating multiple prediction sets with individual 1-\alpha coverage inevitably weakens the overall guarantee, typically resulting in 1-2\alpha worst-case coverage. In this work, we propose a framework for the weighted aggregation of prediction sets, where weights are assigned to each prediction set based on their contribution. Our framework offers flexible control over how the sets are aggregated, achieving tighter coverage bounds that interpolate between the 1-2\alpha guarantee of the combined models and the 1-\alpha guarantee of an individual model depending on the distribution of weights. We extend our framework to data-dependent weights, and we derive a general procedure for data-dependent weight aggregation that maintains finite-sample validity. We demonstrate the effectiveness of our methods through experiments on synthetic and real data in the mixture-of-experts setting, and we show that aggregation with data-dependent weights provides a form of adaptive coverage.
zh

[AI-186] A Review and Analysis of a Parallel Approach for Decision Tree Learning from Large Data Streams

【速读】:该论文旨在解决大规模数据流中高效且可扩展的决策树学习问题,其提出的解决方案是pdsCART算法(parallel distributed scalable CART)。该算法的关键在于其三个核心能力:支持从数据流中实时学习并增量构建树结构、能够并行处理高吞吐量的流数据,以及与MapReduce框架的无缝集成,从而确保在分布式计算环境中的兼容性和可扩展性。

链接: https://arxiv.org/abs/2505.11780
作者: Zeinab Shiralizadeh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work studies one of the parallel decision tree learning algorithms, pdsCART, designed for scalable and efficient data analysis. The method incorporates three core capabilities. First, it supports real-time learning from data streams, allowing trees to be constructed incrementally. Second, it enables parallel processing of high-volume streaming data, making it well-suited for large-scale applications. Third, the algorithm integrates seamlessly into the MapReduce framework, ensuring compatibility with distributed computing environments. In what follows, we present the algorithm’s key components along with results highlighting its performance and scalability.
zh

[AI-187] Generative and Contrastive Graph Representation Learning

【速读】:该论文旨在解决图自监督学习(Graph SSL)中节点和图表示生成的效率与效果问题,特别是在缺乏标签数据的情况下提升下游任务(如节点分类、聚类和链接预测)的性能。其解决方案的关键在于提出一种融合对比学习与生成式方法优势的新型架构,通过引入社区感知的节点级对比学习以生成更鲁棒的正负样本对,并结合图级对比学习捕捉全局语义信息,同时采用特征掩码、节点扰动和边扰动的综合增强策略,实现更具鲁棒性和多样性的表征学习。

链接: https://arxiv.org/abs/2505.11776
作者: Jiali Chen,Avijit Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Self-supervised learning (SSL) on graphs generates node and graph representations (i.e., embeddings) that can be used for downstream tasks such as node classification, node clustering, and link prediction. Graph SSL is particularly useful in scenarios with limited or no labeled data. Existing SSL methods predominantly follow contrastive or generative paradigms, each excelling in different tasks: contrastive methods typically perform well on classification tasks, while generative methods often excel in link prediction. In this paper, we present a novel architecture for graph SSL that integrates the strengths of both approaches. Our framework introduces community-aware node-level contrastive learning, providing more robust and effective positive and negative node pairs generation, alongside graph-level contrastive learning to capture global semantic information. Additionally, we employ a comprehensive augmentation strategy that combines feature masking, node perturbation, and edge perturbation, enabling robust and diverse representation learning. By incorporating these enhancements, our model achieves superior performance across multiple tasks, including node classification, clustering, and link prediction. Evaluations on open benchmark datasets demonstrate that our model outperforms state-of-the-art methods, achieving a performance lift of 0.23%-2.01% depending on the task and dataset.
zh

[AI-188] HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在数学问题求解评估中过于侧重具有精确解析解或形式化证明的问题,而忽视了应用科学和工程中普遍存在的基于近似的难题。其解决方案的关键在于构建了一个名为HARDMath2的数据集,该数据集包含211个原创问题,涵盖了研究生应用数学课程的核心主题,如边界层分析、WKB方法、非线性偏微分方程的渐近解以及振荡积分的渐近行为。该数据集由哈佛大学相关课程的学生和教师设计与验证,并通过一种创新的协作环境构建,旨在挑战学生编写并完善符合课程大纲的难题,同行验证解法,测试不同模型,并自动将LLM生成的解法与自身答案及数值真实值进行比对。

链接: https://arxiv.org/abs/2505.11774
作者: James V. Roggeveen,Erik Y. Wang,Will Flintoft,Peter Donets,Lucy S. Nathwani,Nickholas Gutierrez,David Ettel,Anton Marius Graf,Siddharth Dandavate,Arjun Nageswaran,Raglan Ward,Ava Williamson,Anne Mykland,Kacper K. Migacz,Yijun Wang,Egemen Bostan,Duy Thuc Nguyen,Zhe He,Marc L. Descoteaux,Felix Yeung,Shida Liu,Jorge García Ponce,Luke Zhu,Yuyang Chen,Ekaterina S. Ivshina,Miguel Fernandez,Minjae Kim,Kennan Gumbs,Matthew Scott Tan,Russell Yang,Mai Hoang,David Brown,Isabella A. Silveira,Lavon Sykes,Ahmed Roman,William Fredenberg,Yiming Chen,Lucas Martin,Yixing Tang,Kelly Werker Smith,Hongyu Liao,Logan G. Wilson,Alexander Dazhen Cai,Andrea Elizabeth Biju,Michael P. Brenner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students’ understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.
zh

[AI-189] Residual Feature Integration is Sufficient to Prevent Negative Transfer

【速读】:该论文试图解决迁移学习中的负迁移(negative transfer)问题,即源域表示与目标域分布不匹配导致性能下降的问题。解决方案的关键在于提出了一种名为残差特征融合(Residual Feature Integration, REFINE)的方法,该方法通过将固定源域表示与可训练的目标域编码器结合,并在生成的联合表示上拟合一个浅层神经网络,从而在适应目标域的同时保留源域的可迁移知识。

链接: https://arxiv.org/abs/2505.11771
作者: Yichen Xu,Ryumei Nakada,Linjun Zhang,Lexin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Transfer learning typically leverages representations learned from a source domain to improve performance on a target task. A common approach is to extract features from a pre-trained model and directly apply them for target prediction. However, this strategy is prone to negative transfer where the source representation fails to align with the target distribution. In this article, we propose Residual Feature Integration (REFINE), a simple yet effective method designed to mitigate negative transfer. Our approach combines a fixed source-side representation with a trainable target-side encoder and fits a shallow neural network on the resulting joint representation, which adapts to the target domain while preserving transferable knowledge from the source domain. Theoretically, we prove that REFINE is sufficient to prevent negative transfer under mild conditions, and derive the generalization bound demonstrating its theoretical benefit. Empirically, we show that REFINE consistently enhances performance across diverse application and data modalities including vision, text, and tabular data, and outperforms numerous alternative solutions. Our method is lightweight, architecture-agnostic, and robust, making it a valuable addition to the existing transfer learning toolbox.
zh

[AI-190] Redefining Neural Operators in d1 Dimensions

【速读】:该论文试图解决神经算子在嵌入空间中演化机制不明确的问题,这一问题限制了设计能够完全捕捉目标系统演化的神经算子的能力。其解决方案的关键在于借鉴偏微分方程(PDE)量子模拟的最新进展,阐明神经算子中的线性演化过程,并在新的 $ d+1 $ 维域上重新定义神经算子。基于此框架,作者提出了Schrödingerised Kernel Neural Operator (SKNO),该方法在 $ d+1 $ 维演化过程中表现出更优的性能,并在多个基准测试和零样本超分辨率任务中取得了最先进的结果。

链接: https://arxiv.org/abs/2505.11766
作者: Haoze Song,Zhihao Li,Xiaobo Zhang,Zecheng Gan,Zhilu Lai,Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Neural Operators have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely validated on universally approximating various operators. Although recent advancements following this definition have developed effective modules to better approximate the kernel function defined on the original domain (with d dimensions, d=1, 2, 3… ), the unclarified evolving mechanism in the embedding spaces blocks our view to design neural operators that can fully capture the target system evolution. Drawing on recent breakthroughs in quantum simulation of partial differential equations (PDEs), we elucidate the linear evolution process in neural operators. Based on that, we redefine neural operators on a new d+1 dimensional domain. Within this framework, we implement our proposed Schrödingerised Kernel Neural Operator (SKNO) aligning better with the d+1 dimensional evolution. In experiments, our d+1 dimensional evolving linear block performs far better than others. Also, we test SKNO’s SOTA performance on various benchmark tests and also the zero-shot super-resolution task. In addition, we analyse the impact of different lifting and recovering operators on the prediction within the redefined NO framework, reflecting the alignment between our model and the underlying d+1 dimensional evolution. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph) Cite as: arXiv:2505.11766 [cs.LG] (or arXiv:2505.11766v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-191] OMAC: A Broad Optimization Framework for LLM -Based Multi-Agent Collaboration

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent Systems, MAS)在设计与优化方面缺乏系统性方法的问题。现有方法多依赖手工设计,导致系统性能受限。其解决方案的关键在于提出OMAC框架,该框架通过识别并优化MAS的五个核心维度,包括代理功能与协作结构,并引入语义初始化器和对比比较器两个策略,实现单维度及多维度的联合优化,从而显著提升复杂任务如代码生成、算术推理和通用推理的表现。

链接: https://arxiv.org/abs/2505.11765
作者: Shijun Li,Hilaf Hasson,Joydeep Ghosh
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, Multi-Agent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high-quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM-based MAS remains limited. In this work, we introduce OMAC, a general framework designed for holistic optimization of LLM-based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on code generation, arithmetic reasoning, and general reasoning tasks against state-of-the-art approaches. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.11765 [cs.MA] (or arXiv:2505.11765v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2505.11765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-192] opology-Aware Knowledge Propagation in Decentralized Learning

【速读】:该论文旨在解决去中心化学习中分布外(out-of-distribution, OOD)知识在设备间有效传播的问题。研究发现,现有的去中心化学习算法在将OOD知识传播至所有设备方面存在显著不足,且OOD数据的位置及通信拓扑结构对知识传播效果有显著影响。论文提出了一种面向拓扑结构的聚合策略,通过考虑设备间的通信拓扑特性来加速OOD知识的传播,从而在模型性能上相较于无拓扑感知的基线方法平均提升了123%。

链接: https://arxiv.org/abs/2505.11760
作者: Mansi Sakarvadia,Nathaniel Hudson,Tian Li,Ian Foster,Kyle Chard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Decentralized learning enables collaborative training of models across naturally distributed data without centralized coordination or maintenance of a global model. Instead, devices are organized in arbitrary communication topologies, in which they can only communicate with neighboring devices. Each device maintains its own local model by training on its local data and integrating new knowledge via model aggregation with neighbors. Therefore, knowledge is propagated across the topology via successive aggregation rounds. We study, in particular, the propagation of out-of-distribution (OOD) knowledge. We find that popular decentralized learning algorithms struggle to propagate OOD knowledge effectively to all devices. Further, we find that both the location of OOD data within a topology, and the topology itself, significantly impact OOD knowledge propagation. We then propose topology-aware aggregation strategies to accelerate (OOD) knowledge propagation across devices. These strategies improve OOD data accuracy, compared to topology-unaware baselines, by 123% on average across models in a topology.
zh

[AI-193] Reachability Barrier Networks: Learning Hamilton-Jacobi Solutions for Smooth and Flexible Control Barrier Functions

【速读】:该论文旨在解决安全关键型控制器在高维非线性自主系统中的生成难题,特别是传统控制屏障函数(Control Barrier Functions, CBFs)在高维空间中难以生成平滑、准确近似的问题。其解决方案的关键在于利用物理信息神经网络(Physics-Informed Neural Networks, PINNs)计算哈密顿-雅可比(Hamilton-Jacobi, HJ)最优控制解,从而生成平滑的CBFs近似,并通过参数化折扣项实现训练后保守性的调节,同时结合保形预测方法提供概率安全性保证。

链接: https://arxiv.org/abs/2505.11755
作者: Matthew Kim,William Sharpless,Hyun Joe Jeong,Sander Tonkens,Somil Bansal,Sylvia Herbert
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Recent developments in autonomous driving and robotics underscore the necessity of safety-critical controllers. Control barrier functions (CBFs) are a popular method for appending safety guarantees to a general control framework, but they are notoriously difficult to generate beyond low dimensions. Existing methods often yield non-differentiable or inaccurate approximations that lack integrity, and thus fail to ensure safety. In this work, we use physics-informed neural networks (PINNs) to generate smooth approximations of CBFs by computing Hamilton-Jacobi (HJ) optimal control solutions. These reachability barrier networks (RBNs) avoid traditional dimensionality constraints and support the tuning of their conservativeness post-training through a parameterized discount term. To ensure robustness of the discounted solutions, we leverage conformal prediction methods to derive probabilistic safety guarantees for RBNs. We demonstrate that RBNs are highly accurate in low dimensions, and safer than the standard neural CBF approach in high dimensions. Namely, we showcase the RBNs in a 9D multi-vehicle collision avoidance problem where it empirically proves to be 5.5x safer and 1.9x less conservative than the neural CBFs, offering a promising method to synthesize CBFs for general nonlinear autonomous systems.
zh

[AI-194] POCAII: Parameter Optimization with Conscious Allocation using Iterative Intelligence

【速读】:该论文旨在解决超参数优化(Hyperparameter Optimization, HPO)中的效率与鲁棒性问题,特别是在资源受限的低预算场景下。其提出的解决方案POCAII的关键在于首次将搜索与评估阶段明确分离,并在两个阶段中分别采用系统化的探索与利用策略,从而实现对HPO预算的灵活管理,即在优化初期侧重于生成竞争性配置,而在接近结束时增加评估力度,这使得POCAII在低预算条件下表现出优于现有先进方法(如SMAC、BOHB和DEHB)的性能,并具备更高的鲁棒性和更低的结果方差。

链接: https://arxiv.org/abs/2505.11745
作者: Joshua Inman,Tanmay Khandait,Lalitha Sankar,Giulia Pedrielli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:In this paper we propose for the first time the hyperparameter optimization (HPO) algorithm POCAII. POCAII differs from the Hyperband and Successive Halving literature by explicitly separating the search and evaluation phases and utilizing principled approaches to exploration and exploitation principles during both phases. Such distinction results in a highly flexible scheme for managing a hyperparameter optimization budget by focusing on search (i.e., generating competing configurations) towards the start of the HPO process while increasing the evaluation effort as the HPO comes to an end. POCAII was compared to state of the art approaches SMAC, BOHB and DEHB. Our algorithm shows superior performance in low-budget hyperparameter optimization regimes. Since many practitioners do not have exhaustive resources to assign to HPO, it has wide applications to real-world problems. Moreover, the empirical evidence showed how POCAII demonstrates higher robustness and lower variance in the results. This is again very important when considering realistic scenarios with extremely expensive models to train. Comments: 21 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2505.11745 [cs.LG] (or arXiv:2505.11745v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11745 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-195] Cloud-Based AI Systems: Leverag ing Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing

【速读】:该论文试图解决云系统中故障检测与缓解的实时性与有效性问题,特别是在面对云计算系统快速发展的规模和动态性时,传统故障检测方法难以适应。解决方案的关键在于提出一种基于大规模语言模型(Massive Language Model, LLM)的智能故障检测与自愈机制框架,该框架结合了现有的机器学习故障检测算法与LLM的自然语言理解能力,通过语义上下文处理系统日志、错误报告和实时数据流,采用多层级架构实现故障分类与异常检测,从而在故障发生前预测潜在问题并自动触发自愈机制。

链接: https://arxiv.org/abs/2505.11743
作者: Cheng Ji,Huaiying Luo
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of cloud computing systems and the increasing complexity of their infrastructure, intelligent mechanisms to detect and mitigate failures in real time are becoming increasingly important. Traditional methods of failure detection are often difficult to cope with the scale and dynamics of modern cloud environments. In this study, we propose a novel AI framework based on Massive Language Model (LLM) for intelligent fault detection and self-healing mechanisms in cloud systems. The model combines existing machine learning fault detection algorithms with LLM’s natural language understanding capabilities to process and parse system logs, error reports, and real-time data streams through semantic context. The method adopts a multi-level architecture, combined with supervised learning for fault classification and unsupervised learning for anomaly detection, so that the system can predict potential failures before they occur and automatically trigger the self-healing mechanism. Experimental results show that the proposed model is significantly better than the traditional fault detection system in terms of fault detection accuracy, system downtime reduction and recovery speed.
zh

[AI-196] Diverging Towards Hallucination: Detection of Failures in Vision-Language Models via Multi-token Aggregation

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在生成过程中出现的幻觉问题,即模型可能生成不存在的物体或不安全的文本。现有检测方法如单标记线性探针(Single-Token Linear Probing, SLP)和P(True)仅分析首个生成标记的对数或其最高得分部分,忽略了早期标记分布中更丰富的信号。论文的关键解决方案是引入多标记可靠性估计(Multi-Token Reliability Estimation, MTRE),通过分析前十个标记的对数似然比和自注意力机制,结合整个早期对数序列的信息,从而更准确地捕捉VLMs的可靠性动态。

链接: https://arxiv.org/abs/2505.11741
作者: Geigh Zollicoffer,Minh Vu,Manish Bhattarai
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) now rival human performance on many multimodal tasks, yet they still hallucinate objects or generate unsafe text. Current hallucination detectors, e.g., single-token linear probing (SLP) and P(True), typically analyze only the logit of the first generated token or just its highest scoring component overlooking richer signals embedded within earlier token distributions. We demonstrate that analyzing the complete sequence of early logits potentially provides substantially more diagnostic information. We emphasize that hallucinations may only emerge after several tokens, as subtle inconsistencies accumulate over time. By analyzing the Kullback-Leibler (KL) divergence between logits corresponding to hallucinated and non-hallucinated tokens, we underscore the importance of incorporating later-token logits to more accurately capture the reliability dynamics of VLMs. In response, we introduce Multi-Token Reliability Estimation (MTRE), a lightweight, white-box method that aggregates logits from the first ten tokens using multi-token log-likelihood ratios and self-attention. Despite the challenges posed by large vocabulary sizes and long logit sequences, MTRE remains efficient and tractable. On MAD-Bench, MM-SafetyBench, MathVista, and four compositional-geometry benchmarks, MTRE improves AUROC by 9.4 +/- 1.3 points over SLP and by 12.1 +/- 1.7 points over P(True), setting a new state-of-the-art in hallucination detection for open-source VLMs.
zh

[AI-197] Simple and Effective Specialized Representations for Fair Classifiers

【速读】:该论文旨在解决公平分类(fair classification)问题,这一问题在国际法规和高影响决策场景中日益受到关注。现有方法通常依赖于对抗学习或敏感群体间的分布匹配,但这些方法存在不稳定或计算成本高的缺点。该论文提出了一种基于特征函数距离(characteristic function distance)的新方法,其关键在于通过特征函数实现对敏感信息的最小化保留,同时保持下游任务的高有效性,从而在保证公平性的同时提升模型的稳定性和效率。此外,该方法通过简化目标函数实现了对常见分类模型的公平性保障,且不引起性能下降。

链接: https://arxiv.org/abs/2505.11740
作者: Alberto Sinigaglia,Davide Sartor,Marina Ceccon,Gian Antonio Susto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Fair classification is a critical challenge that has gained increasing importance due to international regulations and its growing use in high-stakes decision-making settings. Existing methods often rely on adversarial learning or distribution matching across sensitive groups; however, adversarial learning can be unstable, and distribution matching can be computationally intensive. To address these limitations, we propose a novel approach based on the characteristic function distance. Our method ensures that the learned representation contains minimal sensitive information while maintaining high effectiveness for downstream tasks. By utilizing characteristic functions, we achieve a more stable and efficient solution compared to traditional methods. Additionally, we introduce a simple relaxation of the objective function that guarantees fairness in common classification models with no performance degradation. Experimental results on benchmark datasets demonstrate that our approach consistently matches or achieves better fairness and predictive accuracy than existing methods. Moreover, our method maintains robustness and computational efficiency, making it a practical solution for real-world applications.
zh

[AI-198] Automated Real-time Assessment of Intracranial Hemorrhage Detection AI Using an Ensembled Monitoring Model (EMM)

【速读】:该论文试图解决放射科中部署的生成式 AI (Generative AI) 工具在实际应用中缺乏实时、逐例评估其预测置信度的问题,这导致用户需自行判断 AI 预测的可靠性,从而增加认知负担并可能引发误诊。解决方案的关键是提出一种名为 Ensembled Monitoring Model (EMM) 的框架,该框架受临床共识实践启发,通过多专家评审机制对黑盒商业 AI 产品进行独立监测,无需访问内部组件或中间输出即可提供稳健的置信度评估,从而帮助区分可信与不可信预测,提升 AI 工具的整体性能。

链接: https://arxiv.org/abs/2505.11738
作者: Zhongnan Fang,Andrew Johnston,Lina Cheuy,Hye Sun Na,Magdalini Paschali,Camila Gonzalez,Bonnie A. Armstrong,Arogya Koirala,Derrick Laurel,Andrew Walker Campion,Michael Iv,Akshay S. Chaudhari,David B. Larson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) tools for radiology are commonly unmonitored once deployed. The lack of real-time case-by-case assessments of AI prediction confidence requires users to independently distinguish between trustworthy and unreliable AI predictions, which increases cognitive burden, reduces productivity, and potentially leads to misdiagnoses. To address these challenges, we introduce Ensembled Monitoring Model (EMM), a framework inspired by clinical consensus practices using multiple expert reviews. Designed specifically for black-box commercial AI products, EMM operates independently without requiring access to internal AI components or intermediate outputs, while still providing robust confidence measurements. Using intracranial hemorrhage detection as our test case on a large, diverse dataset of 2919 studies, we demonstrate that EMM successfully categorizes confidence in the AI-generated prediction, suggesting different actions and helping improve the overall performance of AI tools to ultimately reduce cognitive burden. Importantly, we provide key technical considerations and best practices for successfully translating EMM into clinical settings.
zh

[AI-199] Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

【速读】:该论文试图解决在测试时缩放(Test-time scaling, TTS)过程中验证机制对推理性能和计算效率的双重影响问题。传统验证范式通常仅在生成的最终输出或单个生成步骤中进行验证,而未能系统性地研究验证粒度(verification granularity)的影响。解决方案的关键在于提出一种名为Variable Granularity Search (VG-Search) 的统一算法,通过可调的粒度参数 $ g $ 来泛化束搜索(beam search)和 Best-of-N 采样,从而实现对验证频率的动态控制,提升计算效率与模型扩展性。

链接: https://arxiv.org/abs/2505.11730
作者: Hao Mark Chen,Guanxi Lu,Yasuyuki Okoshi,Zhiwen Mo,Masato Motomura,Hongxiang Fan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint. Under review

点击查看摘要

Abstract:Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1% over Beam Search and 3.6% over Best-of-N, while reducing FLOPs by over 52%. We will open-source the code to support future research.
zh

[AI-200] CLT and Edgeworth Expansion for m-out-of-n Bootstrap Estimators of The Studentized Median

【速读】:该论文旨在解决在估计样本分位数时,对m-out-of-n自助法(m-out-of-n bootstrap)的严格无参数保证缺乏的问题。其解决方案的关键在于通过分析从大小为n的数据集中进行m-out-of-n重采样的样本分位数估计量,首先证明了一个在较弱矩条件下的完全数据驱动的中心极限定理,并且不涉及未知的干扰参数;随后通过构造反例表明该矩假设几乎是紧致的,进一步在略微强化假设的基础上推导出Edgeworth展开,从而提供精确的收敛速率和Berry-Esseen界。最后,通过推导实际统计量的无参数渐近分布,展示了理论在现代估计与学习任务中的实用性。

链接: https://arxiv.org/abs/2505.11725
作者: Imon Banerjee,Sayak Chakrabarty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
备注: 48 pages

点击查看摘要

Abstract:The m-out-of-n bootstrap, originally proposed by Bickel, Gotze, and Zwet (1992), approximates the distribution of a statistic by repeatedly drawing m subsamples (with m much smaller than n) without replacement from an original sample of size n. It is now routinely used for robust inference with heavy-tailed data, bandwidth selection, and other large-sample applications. Despite its broad applicability across econometrics, biostatistics, and machine learning, rigorous parameter-free guarantees for the soundness of the m-out-of-n bootstrap when estimating sample quantiles have remained elusive. This paper establishes such guarantees by analyzing the estimator of sample quantiles obtained from m-out-of-n resampling of a dataset of size n. We first prove a central limit theorem for a fully data-driven version of the estimator that holds under a mild moment condition and involves no unknown nuisance parameters. We then show that the moment assumption is essentially tight by constructing a counter-example in which the CLT fails. Strengthening the assumptions slightly, we derive an Edgeworth expansion that provides exact convergence rates and, as a corollary, a Berry Esseen bound on the bootstrap approximation error. Finally, we illustrate the scope of our results by deriving parameter-free asymptotic distributions for practical statistics, including the quantiles for random walk Metropolis-Hastings and the rewards of ergodic Markov decision processes, thereby demonstrating the usefulness of our theory in modern estimation and learning tasks. Comments: 48 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2505.11725 [cs.LG] (or arXiv:2505.11725v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11725 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-201] Zero-Shot Visual Generalization in Robot Manipulation

【速读】:该论文试图解决视觉基础操作策略在多样视觉环境中的鲁棒性问题(vision-based manipulation policies robustness across diverse visual environments),这是机器人学习中的一个重要且未解决的挑战。其解决方案的关键在于将解耦表示学习(disentangled representation learning)与关联记忆原理相结合,从而提升视觉强化学习策略对视觉分布变化的鲁棒性。此外,该研究还通过引入模型等变性文献中的技术,使训练好的神经网络策略对二维平面旋转具有不变性,进一步增强了策略的视觉鲁棒性和对相机扰动的抗性。

链接: https://arxiv.org/abs/2505.11719
作者: Sumeet Batra,Gaurav Sukhatme
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training vision-based manipulation policies that are robust across diverse visual environments remains an important and unresolved challenge in robot learning. Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth, or by brute-forcing generalization through visual domain randomization and/or large, visually diverse datasets. Disentangled representation learning - especially when combined with principles of associative memory - has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts. However, these techniques have largely been constrained to simpler benchmarks and toy environments. In this work, we scale disentangled representation learning and associative memory to more visually and dynamically complex manipulation tasks and demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware. We further extend this approach to imitation learning, specifically Diffusion Policy, and empirically show significant gains in visual generalization compared to state-of-the-art imitation learning methods. Finally, we introduce a novel technique adapted from the model equivariance literature that transforms any trained neural network policy into one invariant to 2D planar rotations, making our policy not only visually robust but also resilient to certain camera perturbations. We believe that this work marks a significant step towards manipulation policies that are not only adaptable out of the box, but also robust to the complexities and dynamical nature of real-world deployment. Supplementary videos are available at this https URL.
zh

[AI-202] REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning

【速读】:该论文试图解决基于AI的同行评审系统生成的建议过于浅显和过度表扬的问题,其解决方案的关键在于采用一种多目标强化学习(REMOR)训练的推理型大语言模型(LLM)。通过设计一个涵盖评审内容本身(如批评、创新性)及其与论文相关性的多方面奖励函数,并利用PeerRT数据集进行监督微调和Group Relative Policy Optimization(GRPO)训练,最终实现了比人类评审和现有非推理型AI评审系统更高的平均奖励。研究发现,尽管最佳AI与人类评审质量相当,但REMOR能够避免低质量人类评审的长尾问题,其中推理能力是实现这些改进的核心因素。

链接: https://arxiv.org/abs/2505.11718
作者: Pawin Taechoyotin,Daniel Acuna
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:AI-based peer review systems tend to produce shallow and overpraising suggestions compared to human feedback. Here, we evaluate how well a reasoning LLM trained with multi-objective reinforcement learning (REMOR) can overcome these limitations. We start by designing a multi-aspect reward function that aligns with human evaluation of reviews. The aspects are related to the review itself (e.g., criticisms, novelty) and the relationship between the review and the manuscript (i.e., relevance). First, we perform supervised fine-tuning of DeepSeek-R1-Distill-Qwen-7B using LoRA on PeerRT, a new dataset of high-quality top AI conference reviews enriched with reasoning traces. We then apply Group Relative Policy Optimization (GRPO) to train two models: REMOR-H (with the human-aligned reward) and REMOR-U (with a uniform reward). Interestingly, the human-aligned reward penalizes aspects typically associated with strong reviews, leading REMOR-U to produce qualitatively more substantive feedback. Our results show that REMOR-U and REMOR-H achieve more than twice the average rewards of human reviews, non-reasoning state-of-the-art agentic multi-modal AI review systems, and general commercial LLM baselines. We found that while the best AI and human reviews are comparable in quality, REMOR avoids the long tail of low-quality human reviews. We discuss how reasoning is key to achieving these improvements and release the Human-aligned Peer Review Reward (HPRR) function, the Peer Review Reasoning-enriched Traces (PeerRT) dataset, and the REMOR models, which we believe can help spur progress in the area.
zh

[AI-203] Bi-Level Policy Optimization with Nyström Hypergradients

【速读】:该论文试图解决传统Actor-Critic (AC)强化学习算法中,策略网络(actor)与价值网络(critic)之间的依赖关系所导致的优化问题,该问题可被建模为一个双层优化(Bilevel Optimization, BLO)问题,即Stackelberg博弈。解决方案的关键在于提出一种新的算法——基于Nyström超梯度的双层策略优化(Bilevel Policy Optimization with Nyström Hypergradients, BLPO),该算法通过嵌套结构来捕捉BLO的层次特性,并利用Nyström方法高效计算超梯度,从而提升算法的稳定性和收敛性。

链接: https://arxiv.org/abs/2505.11714
作者: Arjun Prakash,Naicheng He,Denizalp Goktas,Amy Greenwald
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:The dependency of the actor on the critic in actor-critic (AC) reinforcement learning means that AC can be characterized as a bilevel optimization (BLO) problem, also called a Stackelberg game. This characterization motivates two modifications to vanilla AC algorithms. First, the critic’s update should be nested to learn a best response to the actor’s policy. Second, the actor should update according to a hypergradient that takes changes in the critic’s behavior into account. Computing this hypergradient involves finding an inverse Hessian vector product, a process that can be numerically unstable. We thus propose a new algorithm, Bilevel Policy Optimization with Nyström Hypergradients (BLPO), which uses nesting to account for the nested structure of BLO, and leverages the Nyström method to compute the hypergradient. Theoretically, we prove BLPO converges to (a point that satisfies the necessary conditions for) a local strong Stackelberg equilibrium in polynomial time with high probability, assuming a linear parametrization of the critic’s objective. Empirically, we demonstrate that BLPO performs on par with or better than PPO on a variety of discrete and continuous control tasks.
zh

[AI-204] DMN-Guided Prompting: A Low-Code Framework for Controlling LLM Behavior

【速读】:该论文试图解决在知识密集型流程中,大型语言模型(Large Language Models, LLMs)因提示策略和质量依赖性而难以有效实现决策逻辑自动化的问题。解决方案的关键在于引入一种基于决策模型与符号(Decision Model and Notation, DMN)的引导式提示框架,该框架通过将复杂的决策逻辑分解为可管理的组件,并引导LLMs沿着结构化的决策路径进行处理,从而提升提示的有效性和可控性。

链接: https://arxiv.org/abs/2505.11701
作者: Shaghayegh Abedi,Amin Jalali
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Large Language Models, Decision Model and Notation, Prompt Engineering, Automated Feedback

点击查看摘要

Abstract:Large Language Models (LLMs) have shown considerable potential in automating decision logic within knowledge-intensive processes. However, their effectiveness largely depends on the strategy and quality of prompting. Since decision logic is typically embedded in prompts, it becomes challenging for end users to modify or refine it. Decision Model and Notation (DMN) offers a standardized graphical approach for defining decision logic in a structured, user-friendly manner. This paper introduces a DMN-guided prompting framework that breaks down complex decision logic into smaller, manageable components, guiding LLMs through structured decision pathways. We implemented the framework in a graduate-level course where students submitted assignments. The assignments and DMN models representing feedback instructions served as inputs to our framework. The instructor evaluated the generated feedback and labeled it for performance assessment. Our approach demonstrated promising results, outperforming chain-of-thought (CoT) prompting. Students also responded positively to the generated feedback, reporting high levels of perceived usefulness in a survey based on the Technology Acceptance Model.
zh

[AI-205] Conditional Deep Generative Models for Belief State Planning

【速读】:该论文试图解决部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)中高维状态下信念(belief)准确表示的问题。传统信念表示方法在处理高维状态和大量观测数据时存在局限性,而本文提出了一种基于条件深度生成模型(conditional deep generative models, cDGMs)的新方法来表示信念。cDGMs能够有效处理高维状态和大规模观测数据,并且可以从后验信念中生成任意数量的样本,从而提升信念表示的准确性与规划性能。实验结果表明,cDGMs在任务无关的信念准确性度量以及规划性能方面均优于粒子滤波基线方法。

链接: https://arxiv.org/abs/2505.11698
作者: Antoine Bigeard,Anthony Corso,Mykel Kochenderfer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Partially observable Markov decision processes (POMDPs) are used to model a wide range of applications, including robotics, autonomous vehicles, and subsurface problems. However, accurately representing the belief is difficult for POMDPs with high-dimensional states. In this paper, we propose a novel approach that uses conditional deep generative models (cDGMs) to represent the belief. Unlike traditional belief representations, cDGMs are well-suited for high-dimensional states and large numbers of observations, and they can generate an arbitrary number of samples from the posterior belief. We train the cDGMs on data produced by random rollout trajectories and show their effectiveness in solving a mineral exploration POMDP with a large and continuous state space. The cDGMs outperform particle filter baselines in both task-agnostic measures of belief accuracy as well as in planning performance.
zh

[AI-206] Qronos: Correcting the Past by Shaping the Future… in Post-Training Quantization

【速读】:该论文试图解决神经网络量化过程中由于权重和激活值量化引起的误差问题,以及前层量化带来的误差累积问题。解决方案的关键在于提出一种基于可解释且系统化优化框架的迭代算法——Qronos,该算法通过最优更新规则在每一步交替进行误差校正和扩散,从而有效减少量化带来的性能损失。此外,Qronos能够高效实现,利用Cholesky分解求解最小二乘问题,并兼容多种现有转换技术,如基于Hadamard的非相干处理和权重-激活缩放均衡等。

链接: https://arxiv.org/abs/2505.11695
作者: Shihao Zhang,Haoyu Zhang,Ian Colbert,Rayan Saab
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We introduce Qronos – a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.
zh

[AI-207] Neural Networks as Universal Finite-State Machines: A Constructive Deterministic Finite Automaton Theory

【速读】:该论文试图解决如何将前馈神经网络作为通用有限状态机(N-FSMs)进行理论和实证建模的问题,具体是证明有限深度的ReLU和阈值网络能够精确模拟确定性有限自动机(DFAs)。解决方案的关键在于通过将状态转移展开为深度神经层来实现状态转换的模拟,并对所需深度、宽度及状态压缩进行了形式化描述。研究还表明,DFA转移是线性可分的,二进制阈值激活函数允许指数级压缩,同时Myhill-Nerode等价类可以在保持可分性的前提下嵌入到连续潜在空间中。此外,论文还明确了表达能力的边界:固定深度的前馈网络无法识别需要无界记忆的非正则语言。

链接: https://arxiv.org/abs/2505.11694
作者: Sahil Rajesh Dhayalkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 15 pages, 1 figure

点击查看摘要

Abstract:We present a complete theoretical and empirical framework establishing feedforward neural networks as universal finite-state machines (N-FSMs). Our results prove that finite-depth ReLU and threshold networks can exactly simulate deterministic finite automata (DFAs) by unrolling state transitions into depth-wise neural layers, with formal characterizations of required depth, width, and state compression. We demonstrate that DFA transitions are linearly separable, binary threshold activations allow exponential compression, and Myhill-Nerode equivalence classes can be embedded into continuous latent spaces while preserving separability. We also formalize the expressivity boundary: fixed-depth feedforward networks cannot recognize non-regular languages requiring unbounded memory. Unlike prior heuristic or probing-based studies, we provide constructive proofs and design explicit DFA-unrolled neural architectures that empirically validate every claim. Our results bridge deep learning, automata theory, and neural-symbolic computation, offering a rigorous blueprint for how discrete symbolic processes can be realized in continuous neural systems.
zh

[AI-208] he Geometry of ReLU Networks through the ReLU Transition Graph

【速读】:该论文试图解决ReLU神经网络结构复杂性与性能之间关系的理论分析问题,特别是如何通过图论方法揭示其表达能力、泛化能力和鲁棒性的内在机制。解决方案的关键在于引入一种名为ReLU Transition Graph (RTG) 的组合对象,该结构将网络的每个线性区域作为节点,并通过单个神经元翻转差异连接相邻区域,从而构建出一个能够反映网络内部动态变化的图模型。基于RTG结构,作者推导出一系列理论结果,包括RTG的大小和直径的紧致组合界、RTG连通性的证明以及VC-维的图论解释,为理解网络复杂性提供了新的视角。

链接: https://arxiv.org/abs/2505.11692
作者: Sahil Rajesh Dhayalkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:We develop a novel theoretical framework for analyzing ReLU neural networks through the lens of a combinatorial object we term the ReLU Transition Graph (RTG). In this graph, each node corresponds to a linear region induced by the network’s activation patterns, and edges connect regions that differ by a single neuron flip. Building on this structure, we derive a suite of new theoretical results connecting RTG geometry to expressivity, generalization, and robustness. Our contributions include tight combinatorial bounds on RTG size and diameter, a proof of RTG connectivity, and graph-theoretic interpretations of VC-dimension. We also relate entropy and average degree of the RTG to generalization error. Each theoretical result is rigorously validated via carefully controlled experiments across varied network depths, widths, and data regimes. This work provides the first unified treatment of ReLU network structure via graph theory and opens new avenues for compression, regularization, and complexity control rooted in RTG analysis.
zh

[AI-209] Second SIGIR Workshop on Simulations for Information Access (Sim4IA 2025) SIGIR’25 SIGIR

【速读】:该论文试图解决在信息获取(Information Access, IA)领域中,由于真实用户不可用或因伦理原因无法参与而导致的交互场景研究与评估难题。其解决方案的关键在于利用模拟技术,通过构建仿真环境来支持IA研究和评估,从而更好地理解用户行为、降低实验复杂性并提高结果的可重复性。

链接: https://arxiv.org/abs/2505.11687
作者: Philipp Schaer,Christin Katharina Kreutz,Krisztian Balog,Timo Breuer,Andreas Konstantin Kruff
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), July 13–18, 2025, Padua, Italy

点击查看摘要

Abstract:Simulations in information access (IA) have recently gained interest, as shown by various tutorials and workshops around that topic. Simulations can be key contributors to central IA research and evaluation questions, especially around interactive settings when real users are unavailable, or their participation is impossible due to ethical reasons. In addition, simulations in IA can help contribute to a better understanding of users, reduce complexity of evaluation experiments, and improve reproducibility. Building on recent developments in methods and toolkits, the second iteration of our Sim4IA workshop aims to again bring together researchers and practitioners to form an interactive and engaging forum for discussions on the future perspectives of the field. An additional aim is to plan an upcoming TREC/CLEF campaign.
zh

[AI-210] OT Score: An OT based Confidence Score for Unsupervised Domain Adaptation

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中现有分布对齐方法在估计分类性能和置信度方面的计算和理论局限性,特别是在没有目标域标签的情况下。其解决方案的关键在于引入一种基于半离散最优传输对齐所诱导的决策边界灵活性的新型理论分析得出的置信度度量——最优传输(Optimal Transport, OT)得分。该方法不仅具有直观的可解释性、理论上的严谨性,还具备计算效率,能够在不重新训练模型的情况下为任意目标伪标签集提供合理的不确定性估计,并能灵活适应不同来源信息的可用程度。

链接: https://arxiv.org/abs/2505.11669
作者: Yiming Zhang,Sitong Liu,Alex Cloninger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address the computational and theoretical limitations of existing distributional alignment methods for unsupervised domain adaptation (UDA), particularly regarding the estimation of classification performance and confidence without target labels. Current theoretical frameworks for these methods often yield computationally intractable quantities and fail to adequately reflect the properties of the alignment algorithms employed. To overcome these challenges, we introduce the Optimal Transport (OT) score, a confidence metric derived from a novel theoretical analysis that exploits the flexibility of decision boundaries induced by Semi-Discrete Optimal Transport alignment. The proposed OT score is intuitively interpretable, theoretically rigorous, and computationally efficient. It provides principled uncertainty estimates for any given set of target pseudo-labels without requiring model retraining, and can flexibly adapt to varying degrees of available source information. Experimental results on standard UDA benchmarks demonstrate that classification accuracy consistently improves by identifying and removing low-confidence predictions, and that OT score significantly outperforms existing confidence metrics across diverse adaptation scenarios.
zh

[AI-211] Learning from Less: Guiding Deep Reinforcement Learning with Differentiable Symbolic Planning

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)代理在面对复杂问题时,缺乏类似人类的“先验知识”导致需要大量训练交互才能表现出适应性行为的问题。解决方案的关键在于提出一种名为Dylan的可微分符号规划器(Differentiable Symbolic Planner),该框架将符号规划整合到强化学习中,通过利用人类先验知识动态调整奖励机制,引导智能体完成中间子任务,从而提升探索效率,并作为高层规划器组合基础策略以生成新行为,同时避免传统符号规划器常见的无限执行循环等问题。

链接: https://arxiv.org/abs/2505.11661
作者: Zihan Ye,Oleg Arenz,Kristian Kersting
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: conference paper, 9 pages

点击查看摘要

Abstract:When tackling complex problems, humans naturally break them down into smaller, manageable subtasks and adjust their initial plans based on observations. For instance, if you want to make coffee at a friend’s place, you might initially plan to grab coffee beans, go to the coffee machine, and pour them into the machine. Upon noticing that the machine is full, you would skip the initial steps and proceed directly to brewing. In stark contrast, state of the art reinforcement learners, such as Proximal Policy Optimization (PPO), lack such prior knowledge and therefore require significantly more training steps to exhibit comparable adaptive behavior. Thus, a central research question arises: \textitHow can we enable reinforcement learning (RL) agents to have similar ``human priors’‘, allowing the agent to learn with fewer training interactions? To address this challenge, we propose differentiable symbolic planner (Dylan), a novel framework that integrates symbolic planning into Reinforcement Learning. Dylan serves as a reward model that dynamically shapes rewards by leveraging human priors, guiding agents through intermediate subtasks, thus enabling more efficient exploration. Beyond reward shaping, Dylan can work as a high level planner that composes primitive policies to generate new behaviors while avoiding common symbolic planner pitfalls such as infinite execution loops. Our experimental evaluations demonstrate that Dylan significantly improves RL agents’ performance and facilitates generalization to unseen tasks.
zh

[AI-212] FLOW-BENCH: Towards Conversational Generation of Enterprise Workflows

【速读】:该论文旨在解决如何利用生成式 AI (Generative AI) 将自然语言 (Natural Language, NL) 指令转化为结构化的业务流程定义的问题,从而推动业务流程自动化 (Business Process Automation, BPA) 的发展。其解决方案的关键在于提出 FLOW-GEN 方法,该方法通过大型语言模型 (Large Language Models, LLMs) 将自然语言转换为具有 Python 语法的中间表示,进而支持最终转换为广泛使用的业务流程定义语言(如 BPMN 和 DMN)。此外,论文还构建了 FLOW-BENCH 数据集,用于评估基于自然语言的 BPA 工具,为该领域的研究提供基准支持。

链接: https://arxiv.org/abs/2505.11646
作者: Evelyn Duesterwald,Siyu Huo,Vatche Isahagian,K.R. Jayaram,Ritesh Kumar,Vinod Muthusamy,Punleuk Oum,Debashish Saha,Gegi Thomas,Praveen Venkateswaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Business process automation (BPA) that leverages Large Language Models (LLMs) to convert natural language (NL) instructions into structured business process artifacts is becoming a hot research topic. This paper makes two technical contributions – (i) FLOW-BENCH, a high quality dataset of paired natural language instructions and structured business process definitions to evaluate NL-based BPA tools, and support bourgeoning research in this area, and (ii) FLOW-GEN, our approach to utilize LLMs to translate natural language into an intermediate representation with Python syntax that facilitates final conversion into widely adopted business process definition languages, such as BPMN and DMN. We bootstrap FLOW-BENCH by demonstrating how it can be used to evaluate the components of FLOW-GEN across eight LLMs of varying sizes. We hope that FLOW-GEN and FLOW-BENCH catalyze further research in BPA making it more accessible to novice and expert users.
zh

[AI-213] PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual Reasoning

【速读】:该论文试图解决多智能体系统中由于后门漏洞导致的安全问题,特别是针对交互式智能体的中毒检测问题。解决方案的关键在于利用智能体之间的相互推理能力,通过评估其他智能体的响应来检测逻辑上不合理的推理过程,从而识别出被污染的智能体。这种方法在基于大语言模型(LLM)的多智能体系统中得到了验证,表现出较高的准确率并有效降低了对正常智能体的误报率。

链接: https://arxiv.org/abs/2505.11642
作者: Falong Fan,Xi Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-agent systems leverage advanced AI models as autonomous agents that interact, cooperate, or compete to complete complex tasks across applications such as robotics and traffic management. Despite their growing importance, safety in multi-agent systems remains largely underexplored, with most research focusing on single AI models rather than interacting agents. This work investigates backdoor vulnerabilities in multi-agent systems and proposes a defense mechanism based on agent interactions. By leveraging reasoning abilities, each agent evaluates responses from others to detect illogical reasoning processes, which indicate poisoned agents. Experiments on LLM-based multi-agent systems, including ChatGPT series and Llama 3, demonstrate the effectiveness of the proposed method, achieving high accuracy in identifying poisoned agents while minimizing false positives on clean agents. We believe this work provides insights into multi-agent system safety and contributes to the development of robust, trustworthy AI interactions.
zh

[AI-214] Chatting with Papers: A Hybrid Approach Using LLM s and Knowledge Graphs

【速读】:该论文试图解决研究人员在处理文献集合时的导航与信息获取问题,特别是在理解特定概念及其上下文以及精炼研究问题方面的需求。解决方案的关键在于提出了一种名为\textitGhostWriter的工作流,该工作流结合了大型语言模型(Large Language Models)和知识图谱(Knowledge Graphs)的技术,通过后端工具套件\textitEverythingData提供查询和“对话”功能,从而支持对文献集合的交互式探索与理解。

链接: https://arxiv.org/abs/2505.11633
作者: Vyacheslav Tykhonov,Han Yang,Philipp Mayr,Jetze Touber,Andrea Scharnhorst
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, Submitted to Joint Workshop of the 5th AI + Informetrics (AII) and the 6th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE)

点击查看摘要

Abstract:This demo paper reports on a new workflow \textitGhostWriter that combines the use of Large Language Models and Knowledge Graphs (semantic artifacts) to support navigation through collections. Situated in the research area of Retrieval Augmented Generation, this specific workflow details the creation of local and adaptable chatbots. Based on the tool-suite \textitEverythingData at the backend, \textitGhostWriter provides an interface that enables querying and ``chatting’’ with a collection. Applied iteratively, the workflow supports the information needs of researchers when interacting with a collection of papers, whether it be to gain an overview, to learn more about a specific concept and its context, and helps the researcher ultimately to refine their research question in a controlled way. We demonstrate the workflow for a collection of articles from the \textitmethod data analysis journal published by GESIS – Leibniz-Institute for the Social Sciences. We also point to further application areas.
zh

[AI-215] Nearest Neighbor Multivariate Time Series Forecasting

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)预测中现有时空图神经网络(Spatial-Temporal Graph Neural Networks, STGNNs)因计算复杂性限制只能使用有限长度输入数据、无法识别整个数据集中的相似模式以及在处理变量间稀疏且不连续相关性的数据时表现不佳的问题。其解决方案的关键在于提出一种简单而有效的k近邻MTS预测(kNN-MTS)框架,该框架通过在大规模缓存序列数据存储中使用MTS模型的表示进行相似性搜索,实现最近邻检索机制,从而无需额外训练即可在测试时直接访问整个数据集,提升模型表达能力并有效提取跨多变量的稀疏分布但相似模式。此外,设计了混合时空编码器(HSTEncoder)以捕捉长期时间依赖和短期时空依赖,进一步提升预测性能。

链接: https://arxiv.org/abs/2505.11625
作者: Huiliang Zhang,Ping Nie,Lijun Sun,Benoit Boulet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Multivariate time series (MTS) forecasting has a wide range of applications in both industry and academia. Recently, spatial-temporal graph neural networks (STGNNs) have gained popularity as MTS forecasting methods. However, current STGNNs can only use the finite length of MTS input data due to the computational complexity. Moreover, they lack the ability to identify similar patterns throughout the entire dataset and struggle with data that exhibit sparsely and discontinuously distributed correlations among variables over an extensive historical period, resulting in only marginal improvements. In this article, we introduce a simple yet effective k-nearest neighbor MTS forecasting ( kNN-MTS) framework, which forecasts with a nearest neighbor retrieval mechanism over a large datastore of cached series, using representations from the MTS model for similarity search. This approach requires no additional training and scales to give the MTS model direct access to the whole dataset at test time, resulting in a highly expressive model that consistently improves performance, and has the ability to extract sparse distributed but similar patterns spanning over multivariables from the entire dataset. Furthermore, a hybrid spatial-temporal encoder (HSTEncoder) is designed for kNN-MTS which can capture both long-term temporal and short-term spatial-temporal dependencies and is shown to provide accurate representation for kNN-MTSfor better forecasting. Experimental results on several real-world datasets show a significant improvement in the forecasting performance of kNN-MTS. The quantitative analysis also illustrates the interpretability and efficiency of kNN-MTS, showing better application prospects and opening up a new path for efficiently using the large dataset in MTS models.
zh

[AI-216] A Classical View on Benign Overfitting: The Role of Sample Size

【速读】:该论文试图解决机器学习中“几乎良性过拟合”(almost benign overfitting)现象的理论理解问题,即模型在训练数据上达到极小的训练误差的同时,仍能保持良好的泛化性能。其解决方案的关键在于通过分析样本量与模型复杂度之间的相互作用,揭示大模型如何在实现良好拟合的同时接近贝叶斯最优泛化。研究通过两个案例——核岭回归和使用梯度流训练的两层全连接ReLU神经网络——提供了理论证据,并采用了一种新的证明方法,将过拟合风险分解为估计误差和近似误差,同时将梯度流视为隐式正则化器,从而避免了统一收敛的陷阱。

链接: https://arxiv.org/abs/2505.11621
作者: Junhyung Park,Patrick Bloebaum,Shiva Prasad Kasiviswanathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: The results here subsume: arXiv:2410.06191

点击查看摘要

Abstract:Benign overfitting is a phenomenon in machine learning where a model perfectly fits (interpolates) the training data, including noisy examples, yet still generalizes well to unseen data. Understanding this phenomenon has attracted considerable attention in recent years. In this work, we introduce a conceptual shift, by focusing on almost benign overfitting, where models simultaneously achieve both arbitrarily small training and test errors. This behavior is characteristic of neural networks, which often achieve low (but non-zero) training error while still generalizing well. We hypothesize that this almost benign overfitting can emerge even in classical regimes, by analyzing how the interaction between sample size and model complexity enables larger models to achieve both good training fit but still approach Bayes-optimal generalization. We substantiate this hypothesis with theoretical evidence from two case studies: (i) kernel ridge regression, and (ii) least-squares regression using a two-layer fully connected ReLU neural network trained via gradient flow. In both cases, we overcome the strong assumptions often required in prior work on benign overfitting. Our results on neural networks also provide the first generalization result in this setting that does not rely on any assumptions about the underlying regression function or noise, beyond boundedness. Our analysis introduces a novel proof technique based on decomposing the excess risk into estimation and approximation errors, interpreting gradient flow as an implicit regularizer, that helps avoid uniform convergence traps. This analysis idea could be of independent interest. Comments: The results here subsume: arXiv:2410.06191 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2505.11621 [cs.LG] (or arXiv:2505.11621v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11621 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-217] Benchmarking Spatiotemporal Reasoning in LLM Reasoning in LLMs and Reasoning Models: Capabilities and Challenges

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)和大型推理模型(Large Reasoning Models, LRMs)在复杂时空信号推理能力方面的不足。其解决方案的关键是提出一个分层的时空推理基准测试框架STARK,该框架从三个层次评估模型的推理能力:状态估计、时空状态推理以及融合世界知识的推理。通过构建26个具有不同传感器模态的时空任务,共计14,552个挑战,STARK为评估模型在几何推理、时空关系推断及结合上下文与领域知识的推理任务中的表现提供了系统性方法。

链接: https://arxiv.org/abs/2505.11618
作者: Pengrui Quan,Brian Wang,Kang Yang,Liying Han,Mani Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Spatiotemporal reasoning plays a key role in Cyber-Physical Systems (CPS). Despite advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs), their capacity to reason about complex spatiotemporal signals remains underexplored. This paper proposes a hierarchical SpatioTemporal reAsoning benchmaRK, STARK, to systematically evaluate LLMs across three levels of reasoning complexity: state estimation (e.g., predicting field variables, localizing and tracking events in space and time), spatiotemporal reasoning over states (e.g., inferring spatial-temporal relationships), and world-knowledge-aware reasoning that integrates contextual and domain knowledge (e.g., intent prediction, landmark-aware navigation). We curate 26 distinct spatiotemporal tasks with diverse sensor modalities, comprising 14,552 challenges where models answer directly or by Python Code Interpreter. Evaluating 3 LRMs and 8 LLMs, we find LLMs achieve limited success in tasks requiring geometric reasoning (e.g., multilateration or triangulation), particularly as complexity increases. Surprisingly, LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods. Our results show that in reasoning tasks requiring world knowledge, the performance gap between LLMs and LRMs narrows, with some LLMs even surpassing LRMs. However, the LRM o3 model continues to achieve leading performance across all evaluated tasks, a result attributed primarily to the larger size of the reasoning models. STARK motivates future innovations in model architectures and reasoning paradigms for intelligent CPS by providing a structured framework to identify limitations in the spatiotemporal reasoning of LLMs and LRMs.
zh

[AI-218] Heart2Mind: Human-Centered Contestable Psychiatric Disorder Diagnosis System using Wearable ECG Monitors

【速读】:该论文旨在解决精神障碍诊断中因主观评估和可及性问题导致的临床挑战,从而可能导致治疗延迟。其解决方案的关键在于开发一种以人为本的可争议精神障碍诊断系统——Heart2Mind,该系统利用可穿戴心电图(ECG)监测设备获取心脏生物标志物,如心率变异性(HRV)和R-R间隔(RRI)时间序列,作为自主神经功能障碍的客观指标,并结合多尺度时频变换和可争议的大型语言模型(LLMs)实现透明且可验证的诊断过程。

链接: https://arxiv.org/abs/2505.11612
作者: Hung Nguyen,Alireza Rahimi,Veronica Whitford,Hélène Fournier,Irina Kondratova,René Richard,Hung Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 41 pages

点击查看摘要

Abstract:Psychiatric disorders affect millions globally, yet their diagnosis faces significant challenges in clinical practice due to subjective assessments and accessibility concerns, leading to potential delays in treatment. To help address this issue, we present Heart2Mind, a human-centered contestable psychiatric disorder diagnosis system using wearable electrocardiogram (ECG) monitors. Our approach leverages cardiac biomarkers, particularly heart rate variability (HRV) and R-R intervals (RRI) time series, as objective indicators of autonomic dysfunction in psychiatric conditions. The system comprises three key components: (1) a Cardiac Monitoring Interface (CMI) for real-time data acquisition from Polar H9/H10 devices; (2) a Multi-Scale Temporal-Frequency Transformer (MSTFT) that processes RRI time series through integrated time-frequency domain analysis; (3) a Contestable Diagnosis Interface (CDI) combining Self-Adversarial Explanations (SAEs) with contestable Large Language Models (LLMs). Our MSTFT achieves 91.7% accuracy on the HRV-ACC dataset using leave-one-out cross-validation, outperforming state-of-the-art methods. SAEs successfully detect inconsistencies in model predictions by comparing attention-based and gradient-based explanations, while LLMs enable clinicians to validate correct predictions and contest erroneous ones. This work demonstrates the feasibility of combining wearable technology with Explainable Artificial Intelligence (XAI) and contestable LLMs to create a transparent, contestable system for psychiatric diagnosis that maintains clinical oversight while leveraging advanced AI capabilities. Our implementation is publicly available at: this https URL.
zh

[AI-219] Foundation Models for AI-Enabled Biological Design AAAI2025

【速读】:该论文试图解决如何利用生成式 AI (Generative AI) 进行生物设计的问题,特别是通过大规模、自监督模型在蛋白质工程、小分子设计和基因组序列设计等任务中的应用。解决方案的关键在于构建有效的生物序列建模架构,提升生成过程的可控性,并实现多模态数据的整合,以提高生物序列生成的质量和适用性。

链接: https://arxiv.org/abs/2505.11610
作者: Asher Moldwin,Amarda Shehu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Genomics (q-bio.GN)
备注: Published as part of the workshop proceedings at AAAI 2025 in the workshop “Foundation Models for Biological Discoveries”

点击查看摘要

Abstract:This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions in adapting these models for biological applications, including biological sequence modeling architectures, controllability in generation, and multi-modal integration. The survey concludes with a discussion of open problems and future directions, offering concrete next-steps to improve the quality of biological sequence generation.
zh

[AI-220] Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search KDD2025

【速读】:该论文旨在解决特征选择过程中难以捕捉复杂特征交互以及在不同场景下适应性不足的问题,同时克服现有方法在连续空间中嵌入特征子集时的排列敏感性和基于梯度搜索的凸性假设限制。其解决方案的关键在于提出一种新的框架:首先,通过编码器-解码器范式将特征子集知识嵌入到连续空间中,并通过成对关系建模实现排列不变性;其次,采用基于策略的强化学习方法探索嵌入空间,无需依赖强凸性假设,从而有效提升搜索效率和子集质量。

链接: https://arxiv.org/abs/2505.11601
作者: Rui Liu,Rui Xie,Zijun Yao,Yanjie Fu,Dongjie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: KDD 2025

点击查看摘要

Abstract:Feature selection removes redundant features to enhanc performance and computational efficiency in downstream tasks. Existing works often struggle to capture complex feature interactions and adapt to diverse scenarios. Recent advances in this domain have incorporated generative intelligence to address these drawbacks by uncovering intricate relationships between features. However, two key limitations remain: 1) embedding feature subsets in a continuous space is challenging due to permutation sensitivity, as changes in feature order can introduce biases and weaken the embedding learning process; 2) gradient-based search in the embedding space assumes convexity, which is rarely guaranteed, leading to reduced search effectiveness and suboptimal subsets. To address these limitations, we propose a new framework that can: 1) preserve feature subset knowledge in a continuous embedding space while ensuring permutation invariance; 2) effectively explore the embedding space without relying on strong convex assumptions. For the first objective, we develop an encoder-decoder paradigm to preserve feature selection knowledge into a continuous embedding space. This paradigm captures feature interactions through pairwise relationships within the subset, removing the influence of feature order on the embedding. Moreover, an inducing point mechanism is introduced to accelerate pairwise relationship computations. For the second objective, we employ a policy-based reinforcement learning (RL) approach to guide the exploration of the embedding space. The RL agent effectively navigates the space by balancing multiple objectives. By prioritizing high-potential regions adaptively and eliminating the reliance on convexity assumptions, the RL agent effectively reduces the risk of converging to local optima. Extensive experiments demonstrate the effectiveness, efficiency, robustness and explicitness of our model.
zh

[AI-221] he Ripple Effect: On Unforeseen Complications of Backdoor Attacks ICML2025

【速读】:该论文试图解决后门复杂性(backdoor complications)问题,即攻击者植入的后门预训练语言模型(PTLM)在适应不同下游任务时可能导致不可预见的输出偏差,从而引发用户怀疑并降低攻击隐蔽性。解决方案的关键在于提出一种基于多任务学习(multi-task learning)的后门复杂性缓解方法,该方法无需事先了解下游任务即可有效减少复杂性,同时保持后门攻击的效果和一致性。

链接: https://arxiv.org/abs/2505.11586
作者: Rui Zhang,Yun Shen,Hongwei Li,Wenbo Jiang,Hanxiao Chen,Yuan Zhang,Guowen Xu,Yang Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks. These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks. In reality, these PTLMs can be adapted to many other unrelated downstream tasks. Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness. We refer to this phenomenon as backdoor complications. In this paper, we undertake the first comprehensive quantification of backdoor complications. Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs. The output distribution of triggered samples significantly deviates from that of clean samples. Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks. The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks. Our code is available at this https URL.
zh

[AI-222] LLM Agents Are Hypersensitive to Nudges

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂现实环境中进行序列决策和工具使用时,其选择分布的特性及其对不同选择架构的敏感性问题。研究的关键在于通过案例分析,揭示LLMs在决策过程中的行为模式与人类的差异,包括其对提示策略(如默认选项、建议和信息突出)的高度敏感性、信息获取策略的不同以及奖励机制的影响。此外,研究还探索了通过简单提示策略(如零样本思维链)和少量样本提示(结合人类数据)来调整模型选择分布的可能性,但并未解决其对提示的敏感性问题。

链接: https://arxiv.org/abs/2505.11584
作者: Manuel Cherep,Pattie Maes,Nikhil Singh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 28 figures

点击查看摘要

Abstract:LLMs are being set loose in complex, real-world environments involving sequential decision-making and tool use. Often, this involves making choices on behalf of human users. However, not much is known about the distribution of such choices, and how susceptible they are to different choice architectures. We perform a case study with a few such LLM models on a multi-attribute tabular decision-making problem, under canonical nudges such as the default option, suggestions, and information highlighting, as well as additional prompting strategies. We show that, despite superficial similarities to human choice distributions, such models differ in subtle but important ways. First, they show much higher susceptibility to the nudges. Second, they diverge in points earned, being affected by factors like the idiosyncrasy of available prizes. Third, they diverge in information acquisition strategies: e.g. incurring substantial cost to reveal too much information, or selecting without revealing any. Moreover, we show that simple prompt strategies like zero-shot chain of thought (CoT) can shift the choice distribution, and few-shot prompting with human data can induce greater alignment. Yet, none of these methods resolve the sensitivity of these models to nudges. Finally, we show how optimal nudges optimized with a human resource-rational model can similarly increase LLM performance for some models. All these findings suggest that behavioral tests are needed before deploying models as agents or assistants acting on behalf of users in complex environments.
zh

[AI-223] Comparing Lexical and Semantic Vector Search Methods When Classifying Medical Documents

【速读】:该论文试图解决的是如何高效且准确地对结构化医学文档进行分类的问题,其核心挑战在于选择合适的向量搜索方法以提升预测准确性并优化执行效率。论文的关键解决方案是对比了现成的语义向量搜索(semantic vector search)与定制化的词汇向量搜索(lexical vector search)模型的效果,发现后者在预测准确性上略优于前者,并且在执行时间上更具优势。这表明在特定应用场景下,传统方法仍具有竞争力,不应被神经网络模型完全取代。

链接: https://arxiv.org/abs/2505.11582
作者: Lee Harris,Philippe De Wilde,James Bentham
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classification is a common AI problem, and vector search is a typical solution. This transforms a given body of text into a numerical representation, known as an embedding, and modern improvements to vector search focus on optimising speed and predictive accuracy. This is often achieved through neural methods that aim to learn language semantics. However, our results suggest that these are not always the best solution. Our task was to classify rigidly-structured medical documents according to their content, and we found that using off-the-shelf semantic vector search produced slightly worse predictive accuracy than creating a bespoke lexical vector search model, and that it required significantly more time to execute. These findings suggest that traditional methods deserve to be contenders in the information retrieval toolkit, despite the prevalence and success of neural models.
zh

[AI-224] Flash Invariant Point Attention

【速读】:该论文试图解决Invariant Point Attention (IPA)算法在结构生物学中几何感知建模应用时面临的计算复杂度高、输入序列长度受限的问题。其解决方案的关键在于提出FlashIPA,这是一种基于因子分解的IPA重写方法,利用了硬件高效的FlashAttention机制,从而实现了GPU内存和实际运行时间与序列长度的线性关系,显著降低了计算成本并扩展了训练序列长度。

链接: https://arxiv.org/abs/2505.11580
作者: Andrew Liu,Axel Elaldi,Nicholas T Franklin,Nathan Russell,Gurinder S Atwal,Yih-En A Ban,Olivia Viessmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Invariant Point Attention (IPA) is a key algorithm for geometry-aware modeling in structural biology, central to many protein and RNA models. However, its quadratic complexity limits the input sequence length. We introduce FlashIPA, a factorized reformulation of IPA that leverages hardware-efficient FlashAttention to achieve linear scaling in GPU memory and wall-clock time with sequence length. FlashIPA matches or exceeds standard IPA performance while substantially reducing computational costs. FlashIPA extends training to previously unattainable lengths, and we demonstrate this by re-training generative models without length restrictions and generating structures of thousands of residues. FlashIPA is available at this https URL.
zh

[AI-225] oward Adaptive Categories: Dimensional Governance for Agent ic AI

【速读】:该论文试图解决传统分类治理框架在面对动态AI系统时的不足,这些系统基于基础模型、自监督学习和多智能体架构,使得原有的固定风险等级、自主性水平或人类监督模式难以有效适用。解决方案的关键在于提出维度治理(dimensional governance),通过跟踪决策权、过程自主性和问责性(3As)在人机关系中的动态分布,实现对关键治理阈值的显式监控,并在风险发生前进行前瞻性调整,从而为更具适应性的分类体系提供基础。

链接: https://arxiv.org/abs/2505.11579
作者: Zeynep Engin,David Hand
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages core text, 14 pages including references, 2 figures

点击查看摘要

Abstract:As AI systems evolve from static tools to dynamic agents, traditional categorical governance frameworks – based on fixed risk tiers, levels of autonomy, or human oversight models – are increasingly insufficient on their own. Systems built on foundation models, self-supervised learning, and multi-agent architectures increasingly blur the boundaries that categories were designed to police. In this Perspective, we make the case for dimensional governance: a framework that tracks how decision authority, process autonomy, and accountability (the 3As) distribute dynamically across human-AI relationships. A critical advantage of this approach is its ability to explicitly monitor system movement toward and across key governance thresholds, enabling preemptive adjustments before risks materialize. This dimensional approach provides the necessary foundation for more adaptive categorization, enabling thresholds and classifications that can evolve with emerging capabilities. While categories remain essential for decision-making, building them upon dimensional foundations allows for context-specific adaptability and stakeholder-responsive governance that static approaches cannot achieve. We outline key dimensions, critical trust thresholds, and practical examples illustrating where rigid categorical frameworks fail – and where a dimensional mindset could offer a more resilient and future-proof path forward for both governance and innovation at the frontier of artificial intelligence.
zh

[AI-226] Spatiotemporal Field Generation Based on Hybrid Mamba-Transformer with Physics-informed Fine-tuning

【速读】:该论文旨在解决数据驱动模型在生成时空物理场时遇到的物理方程显著偏差问题。其关键解决方案是提出一种基于混合Mamba-Transformer架构的时空物理场生成模型(HMT-PF),该模型引入非结构化网格信息作为输入,并设计了一个融合物理信息的微调模块,通过点查询机制计算物理方程残差并将其编码到潜在空间进行优化,从而有效减少物理方程偏差。此外,采用自监督学习方法在保持场特征的同时实现物理一致性。

链接: https://arxiv.org/abs/2505.11578
作者: Peimian Du,Jiabin Liu,Xiaowei Jin,Mengwang Zuo,Hui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:This research confronts the challenge of substantial physical equation discrepancies encountered in the generation of spatiotemporal physical fields through data-driven trained models. A spatiotemporal physical field generation model, named HMT-PF, is developed based on the hybrid Mamba-Transformer architecture, incorporating unstructured grid information as input. A fine-tuning block, enhanced with physical information, is introduced to effectively reduce the physical equation discrepancies. The physical equation residuals are computed through a point query mechanism for efficient gradient evaluation, then encoded into latent space for refinement. The fine-tuning process employs a self-supervised learning approach to achieve physical consistency while maintaining essential field characteristics. Results show that the hybrid Mamba-Transformer model achieves good performance in generating spatiotemporal fields, while the physics-informed fine-tuning mechanism further reduces significant physical errors effectively. A MSE-R evaluation method is developed to assess the accuracy and realism of physical field generation.
zh

[AI-227] he Accountability Paradox: How Platform API Restrictions Undermine AI Transparency Mandates

【速读】:该论文试图解决主要社交媒体平台在应对欧盟《数字服务法案》(Digital Services Act)时出现的合规性问题,特别是数据访问受限导致算法透明性不足的挑战。解决方案的关键在于构建一个结构化的审计框架,以评估监管要求与平台实施之间的日益加剧的不一致,并识别出内容审核和算法增强方面的“审计盲区”。研究提出应通过联邦化访问模型和加强监管执行等针对性政策干预,缓解平台依赖人工智能系统所带来的“责任悖论”。

链接: https://arxiv.org/abs/2505.11577
作者: FLorian A.D. Burnat,Brittany I. Davidson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent application programming interface (API) restrictions on major social media platforms challenge compliance with the EU Digital Services Act [20], which mandates data access for algorithmic transparency. We develop a structured audit framework to assess the growing misalignment between regulatory requirements and platform implementations. Our comparative analysis of X/Twitter, Reddit, TikTok, and Meta identifies critical audit blind-spots'' where platform content moderation and algorithmic amplification remain inaccessible to independent verification. Our findings reveal an accountability paradox’': as platforms increasingly rely on AI systems, they simultaneously restrict the capacity for independent oversight. We propose targeted policy interventions aligned with the AI Risk Management Framework of the National Institute of Standards and Technology [80], emphasizing federated access models and enhanced regulatory enforcement.
zh

[AI-228] InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在量化后数学推理能力显著下降的问题,量化虽然能够降低内存占用和推理延迟,但可能导致高达69.81%的数学推理准确率下降。解决方案的关键在于通过自动化数据筛选与优化流程构建一个名为“Silver Bullet”的紧凑数据集,并利用少量精心选择的示例在单块GPU上进行短时间微调,从而有效恢复量化模型的推理性能至全精度基线水平。

链接: https://arxiv.org/abs/2505.11574
作者: Zhen Li,Yupeng Su,Songmiao Wang,Runming Yang,Congkai Xie,Aofan Liu,Ming Li,Jiannong Cao,Yuan Xie,Ngai Wong,Hongxia Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME. However, the substantial computational demands of these tasks pose significant challenges for real-world deployment. Model quantization has emerged as a promising approach to reduce memory footprint and inference latency by representing weights and activations with lower bit-widths. In this work, we conduct a comprehensive study of mainstream quantization methods(e.g., AWQ, GPTQ, SmoothQuant) on the most popular open-sourced models (e.g., Qwen2.5, LLaMA3 series), and reveal that quantization can degrade mathematical reasoning accuracy by up to 69.81%. To better understand this degradation, we develop an automated assignment and judgment pipeline that qualitatively categorizes failures into four error types and quantitatively identifies the most impacted reasoning capabilities. Building on these findings, we employ an automated data-curation pipeline to construct a compact “Silver Bullet” datasets. Training a quantized model on as few as 332 carefully selected examples for just 3-5 minutes on a single GPU is enough to restore its reasoning accuracy to match that of the full-precision baseline.
zh

[AI-229] ool-Aided Evolutionary LLM for Generative Policy Toward Efficient Resource Management in Wireless Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在动态且异构的无线环境中,由于设备选择与高维资源分配效率低下而导致的训练效率问题。传统方法依赖于领域专业知识、大量超参数调优以及高昂的交互成本,难以适应复杂网络条件。论文提出的工具辅助进化大语言模型(Tool-aided Evolutionary Large Language Model, T-ELLM)框架通过自然语言驱动的场景提示提升泛化能力,并将联合优化问题数学解耦,使设备选择策略可学习,同时将资源分配委托给凸优化工具。其关键在于引入一种高效采样的基于模型的虚拟学习环境,以捕捉设备选择与学习性能之间的关系,从而实现后续群体相对策略优化,减少对真实环境交互的依赖,降低通信开销并保持决策精度。

链接: https://arxiv.org/abs/2505.11570
作者: Chongyang Tan,Ruoqi Wen,Rongpeng Li,Zhifeng Zhao,Ekram Hossain,Honggang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables distributed model training across edge devices in a privacy-friendly manner. However, its efficiency heavily depends on effective device selection and high-dimensional resource allocation in dynamic and heterogeneous wireless environments. Conventional methods demand a confluence of domain-specific expertise, extensive hyperparameter tuning, and/or heavy interaction cost. This paper proposes a Tool-aided Evolutionary Large Language Model (T-ELLM) framework to generate a qualified policy for device selection in a wireless FL environment. Unlike conventional optimization methods, T-ELLM leverages natural language-based scenario prompts to enhance generalization across varying network conditions. The framework decouples the joint optimization problem mathematically, enabling tractable learning of device selection policies while delegating resource allocation to convex optimization tools. To improve adaptability, T-ELLM integrates a sample-efficient, model-based virtual learning environment that captures the relationship between device selection and learning performance, facilitating subsequent group relative policy optimization. This concerted approach reduces reliance on real-world interactions, minimizing communication overhead while maintaining high-fidelity decision-making. Theoretical analysis proves that the discrepancy between virtual and real environments is bounded, ensuring the advantage function learned in the virtual environment maintains a provably small deviation from real-world conditions. Experimental results demonstrate that T-ELLM outperforms benchmark methods in energy efficiency and exhibits robust adaptability to environmental changes.
zh

[AI-230] owards Adaptive Deep Learning: Model Elasticity via Prune-and-Grow CNN Architectures

【速读】:该论文旨在解决在资源受限设备上部署深度卷积神经网络(Convolutional Neural Networks, CNNs)所面临的计算需求高和架构僵化的问题。其关键解决方案是提出一种结构化的剪枝与动态重构方法,通过在单一CNN模型内创建嵌套子网络,使网络能够在运行时动态调整计算复杂度,从而在不进行重新训练的情况下切换紧凑型与全尺寸配置,实现性能与资源利用效率的平衡。

链接: https://arxiv.org/abs/2505.11569
作者: Pooja Mangal,Sudaksh Kalra,Dolly Sapra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 50 Pages, 11 figures, Preprint

点击查看摘要

Abstract:Deploying deep convolutional neural networks (CNNs) on resource-constrained devices presents significant challenges due to their high computational demands and rigid, static architectures. To overcome these limitations, this thesis explores methods for enabling CNNs to dynamically adjust their computational complexity based on available hardware resources. We introduce adaptive CNN architectures capable of scaling their capacity at runtime, thus efficiently balancing performance and resource utilization. To achieve this adaptability, we propose a structured pruning and dynamic re-construction approach that creates nested subnetworks within a single CNN model. This approach allows the network to dynamically switch between compact and full-sized configurations without retraining, making it suitable for deployment across varying hardware platforms. Experiments conducted across multiple CNN architectures including VGG-16, AlexNet, ResNet-20, and ResNet-56 on CIFAR-10 and Imagenette datasets demonstrate that adaptive models effectively maintain or even enhance performance under varying computational constraints. Our results highlight that embedding adaptability directly into CNN architectures significantly improves their robustness and flexibility, paving the way for efficient real-world deployment in diverse computational environments.
zh

[AI-231] Beyond Time: Cross-Dimensional Frequency Supervision for Time Series Forecasting

【速读】:该论文旨在解决时间序列预测中模型架构设计复杂且泛化能力不足的问题,以及传统方法对独立同分布(IID)数据假设与时间序列强相关性之间的矛盾。其解决方案的关键在于提出一种纯频域监督方法——交叉维度频域(X-Freq)损失,通过利用时间序列的信息熵高于谱熵的统计现象,强调频域中的更高确定性以提供更优的监督信号,并结合傅里叶变换和小波变换分别捕捉时间维度和通道维度的长期、短期频率变化及空间特征,最终在频域中统一计算预测与目标之间的损失。

链接: https://arxiv.org/abs/2505.11567
作者: Tianyi Shi,Zhu Meng,Yue Chen,Siyang Zheng,Fei Su,Jin Huang,Changrui Ren,Zhicheng Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting plays a crucial role in various fields, and the methods based on frequency domain analysis have become an important branch. However, most existing studies focus on the design of elaborate model architectures and are often tailored for limited datasets, still lacking universality. Besides, the assumption of independent and identically distributed (IID) data also contradicts the strong correlation of the time domain labels. To address these issues, abandoning time domain supervision, we propose a purely frequency domain supervision approach named cross-dimensional frequency (X-Freq) loss. Specifically, based on a statistical phenomenon, we first prove that the information entropy of the time series is higher than its spectral entropy, which implies higher certainty in frequency domain and thus can provide better supervision. Secondly, the Fourier Transform and the Wavelet Transform are applied to the time dimension and the channel dimension of the time series respectively, to capture the long-term and short-term frequency variations as well as the spatial configuration features. Thirdly, the loss between predictions and targets is uniformly computed in the frequency domain. Moreover, we plug-and-play incorporate X-Freq into multiple advanced forecasting models and compare on 14 real-world datasets. The experimental results demonstrate that, without making any modification to the original architectures or hyperparameters, X-Freq can improve the forecasting performance by an average of 3.3% on long-term forecasting datasets and 27.7% on short-term ones, showcasing superior generality and practicality. The code will be released publicly.
zh

[AI-232] ACSE-Eval: Can LLM s threat model real-world cloud infrastructure?

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在云部署中识别安全威胁的有效性问题,这一领域此前尚未得到充分研究。解决方案的关键在于引入AWS Cloud Security Engineering Eval(ACSE-Eval),这是一个用于评估LLMs在云安全威胁建模能力的新型数据集。该数据集包含100个生产级别的AWS部署场景,每个场景均具备详细的架构规范、基础设施即代码实现、已记录的安全漏洞以及相关的威胁建模参数,从而支持对LLMs在云环境中识别安全风险、分析攻击路径和提出缓解策略的能力进行系统性评估。

链接: https://arxiv.org/abs/2505.11565
作者: Sarthak Munshi,Swapnil Pathak,Sonam Ghatode,Thenuga Priyadarshini,Dhivya Chandramouleeswaran,Ashutosh Rana
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Submitted to the 39th Annual Conference on Neural Information Processing Systems

点击查看摘要

Abstract:While Large Language Models have shown promise in cybersecurity applications, their effectiveness in identifying security threats within cloud deployments remains unexplored. This paper introduces AWS Cloud Security Engineering Eval, a novel dataset for evaluating LLMs cloud security threat modeling capabilities. ACSE-Eval contains 100 production grade AWS deployment scenarios, each featuring detailed architectural specifications, Infrastructure as Code implementations, documented security vulnerabilities, and associated threat modeling parameters. Our dataset enables systemic assessment of LLMs abilities to identify security risks, analyze attack vectors, and propose mitigation strategies in cloud environments. Our evaluations on ACSE-Eval demonstrate that GPT 4.1 and Gemini 2.5 Pro excel at threat identification, with Gemini 2.5 Pro performing optimally in 0-shot scenarios and GPT 4.1 showing superior results in few-shot settings. While GPT 4.1 maintains a slight overall performance advantage, Claude 3.7 Sonnet generates the most semantically sophisticated threat models but struggles with threat categorization and generalization. To promote reproducibility and advance research in automated cybersecurity threat analysis, we open-source our dataset, evaluation metrics, and methodologies.
zh

[AI-233] Object-Centric Representations Improve Policy Generalization in Robot Manipulation

【速读】:该论文试图解决机器人操作策略在分布偏移下泛化能力受限的问题,具体表现为现有方法依赖的全局或密集特征容易混入任务相关与无关的场景信息。其解决方案的关键在于引入物体中心表示(object-centric representations, OCR),通过将视觉输入分割为一组明确的实体,引入更符合操作任务的归纳偏置,从而提升模型在复杂和动态环境中的泛化能力。

链接: https://arxiv.org/abs/2505.11563
作者: Alexandre Chapin(imagine),Bruno Machado(imagine),Emmanuel Dellandrea(imagine),Liming Chen(imagine)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.
zh

[AI-234] AC-LoRA: (Almost) Training-Free Access Control-Aware Multi-Modal LLM s

【速读】:该论文试图解决企业在使用生成式 AI (Generative AI) 语言模型(LLMs)进行知识传播和管理时,因模型易泄露敏感信息而难以在需要严格访问控制的场景中应用的问题。解决方案的关键在于设计 AC-LoRA 系统,该系统通过为授权数据集及其微调的文档嵌入分别维护独立的 LoRA(Low-Rank Adaptation)适配器,并根据用户查询与适配器的相似度及权限检索出精确的适配器集合,从而实现信息隔离。在检索到多个适配器时,通过相似度分数合并响应结果,无需额外训练即可完成 LoRA 路由,从而在保证强隔离性的同时保持高性能。

链接: https://arxiv.org/abs/2505.11557
作者: Lara Magdalena Lazier,Aritra Dhar,Vasilije Stambolic,Lukas Cavigelli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Corporate LLMs are gaining traction for efficient knowledge dissemination and management within organizations. However, as current LLMs are vulnerable to leaking sensitive information, it has proven difficult to apply them in settings where strict access control is necessary. To this end, we design AC-LoRA, an end-to-end system for access control-aware corporate LLM chatbots that maintains a strong information isolation guarantee. AC-LoRA maintains separate LoRA adapters for permissioned datasets, along with the document embedding they are finetuned on. AC-LoRA retrieves a precise set of LoRA adapters based on the similarity score with the user query and their permission. This similarity score is later used to merge the responses if more than one LoRA is retrieved, without requiring any additional training for LoRA routing. We provide an end-to-end prototype of AC-LoRA, evaluate it on two datasets, and show that AC-LoRA matches or even exceeds the performance of state-of-the-art LoRA mixing techniques while providing strong isolation guarantees. Furthermore, we show that AC-LoRA design can be directly applied to different modalities.
zh

[AI-235] GSPRec: Temporal-Aware Graph Spectral Filtering for Recommendation

【速读】:该论文旨在解决图推荐系统中存在的两个问题:过度依赖低通滤波导致用户特定信号被抑制,以及在图构建过程中忽略了序列动态性。其解决方案的关键在于提出GSPRec模型,该模型通过顺序感知的图构建整合时间转移,并在谱域中应用频域感知的滤波。GSPRec通过多跳扩散编码物品转移,从而利用对称拉普拉斯矩阵进行谱处理,并设计了双滤波机制,包括高斯带通滤波器以提取中频用户级模式和低通滤波器以保留全局趋势。

链接: https://arxiv.org/abs/2505.11552
作者: Ahmad Bin Rabiah,Julian McAuley
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based recommendation systems are effective at modeling collaborative patterns but often suffer from two limitations: overreliance on low-pass filtering, which suppresses user-specific signals, and omission of sequential dynamics in graph construction. We introduce GSPRec, a graph spectral model that integrates temporal transitions through sequentially-informed graph construction and applies frequency-aware filtering in the spectral domain. GSPRec encodes item transitions via multi-hop diffusion to enable the use of symmetric Laplacians for spectral processing. To capture user preferences, we design a dual-filtering mechanism: a Gaussian bandpass filter to extract mid-frequency, user-level patterns, and a low-pass filter to retain global trends. Extensive experiments on four public datasets show that GSPRec consistently outperforms baselines, with an average improvement of 6.77% in NDCG@10. Ablation studies show the complementary benefits of both sequential graph augmentation and bandpass filtering.
zh

[AI-236] One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

【速读】:该论文试图解决知识库在检索增强生成(RAG)系统中遭受中毒攻击时的两个核心挑战:一是注入的恶意内容需与多个真实文档竞争,二是大型语言模型(LLM)倾向于信任与其内部记忆知识一致的检索信息。解决方案的关键在于提出一种名为AuthChain的新颖知识中毒攻击方法,该方法利用证据链理论和权威效应,生成具有强证据链和权威性陈述的中毒内容,从而有效克服真实文档和LLM内部知识的干扰。

链接: https://arxiv.org/abs/2505.11548
作者: Zhiyuan Chang,Xiaojun Jia,Mingyang Li,Junjie Wang,Yuekai Huang,Qing Wang,Ziyou Jiang,Yang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 15pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. Poisoning attacks on knowledge bases for RAG systems face two fundamental challenges: the injected malicious content must compete with multiple authentic documents retrieved by the retriever, and LLMs tend to trust retrieved information that aligns with their internal memorized knowledge. Previous works attempt to address these challenges by injecting multiple malicious documents, but such saturation attacks are easily detectable and impractical in real-world scenarios. To enable the effective single document poisoning attack, we propose AuthChain, a novel knowledge poisoning attack method that leverages Chain-of-Evidence theory and authority effect to craft more convincing poisoned documents. AuthChain generates poisoned content that establishes strong evidence chains and incorporates authoritative statements, effectively overcoming the interference from both authentic documents and LLMs’ internal knowledge. Extensive experiments across six popular LLMs demonstrate that AuthChain achieves significantly higher attack success rates while maintaining superior stealthiness against RAG defense mechanisms compared to state-of-the-art baselines.
zh

[AI-237] On Technique Identification and Threat-Actor Attribution using LLM s and Embedding Models

【速读】:该论文试图解决网络攻击归因(cyber-attack attribution)中的延迟问题,特别是在大规模国际事件后,手动从密集的取证文档中提取行为指标导致的效率低下。解决方案的关键在于评估大型语言模型(Large Language Models, LLMs)在基于取证文档提取行为指标(Tactics, Techniques, and Procedures, TTPs)方面的有效性,并构建一个端到端的流程,从原始威胁情报(CTI)文档到威胁行为者预测。该方法利用向量嵌入搜索识别TTP,并构建特征以训练机器学习模型进行攻击归因。

链接: https://arxiv.org/abs/2505.11547
作者: Kyla Guru,Robert J. Moss,Mykel J. Kochenderfer
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Attribution of cyber-attacks remains a complex but critical challenge for cyber defenders. Currently, manual extraction of behavioral indicators from dense forensic documentation causes significant attribution delays, especially following major incidents at the international scale. This research evaluates large language models (LLMs) for cyber-attack attribution based on behavioral indicators extracted from forensic documentation. We test OpenAI’s GPT-4 and text-embedding-3-large for identifying threat actors’ tactics, techniques, and procedures (TTPs) by comparing LLM-generated TTPs against human-generated data from MITRE ATTCK Groups. Our framework then identifies TTPs from text using vector embedding search and builds profiles to attribute new attacks for a machine learning model to learn. Key contributions include: (1) assessing off-the-shelf LLMs for TTP extraction and attribution, and (2) developing an end-to-end pipeline from raw CTI documents to threat-actor prediction. This research finds that standard LLMs generate TTP datasets with noise, resulting in a low similarity to human-generated datasets. However, the TTPs generated are similar in frequency to those within the existing MITRE datasets. Additionally, although these TTPs are different than human-generated datasets, our work demonstrates that they still prove useful for training a model that performs above baseline on attribution. Project code and files are contained here: this https URL.
zh

[AI-238] Control Invariant Sets for Neural Network Dynamical Systems and Recursive Feasibility in Model Predictive Control

【速读】:该论文旨在解决基于神经网络的动态系统建模中控制设计所面临的严格安全性和递归可行性保障难题(safety and recursive feasibility guarantees)。其关键解决方案是提出一种算法方法,用于合成针对神经网络动力学模型的控制不变集(control invariant sets),通过集合递归确保在有限迭代次数内终止,并生成闭环动力学前向不变的子集,从而保证持续的操作安全性。此外,论文还提出了将这些控制不变集集成到混合整数优化中的模型预测控制设计,以确保计算层面的安全约束遵循和递归可行性。

链接: https://arxiv.org/abs/2505.11546
作者: Xiao Li,Tianhao Wei,Changliu Liu,Anouck Girard,Ilya Kolmanovsky
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks are powerful tools for data-driven modeling of complex dynamical systems, enhancing predictive capability for control applications. However, their inherent nonlinearity and black-box nature challenge control designs that prioritize rigorous safety and recursive feasibility guarantees. This paper presents algorithmic methods for synthesizing control invariant sets specifically tailored to neural network based dynamical models. These algorithms employ set recursion, ensuring termination after a finite number of iterations and generating subsets in which closed-loop dynamics are forward invariant, thus guaranteeing perpetual operational safety. Additionally, we propose model predictive control designs that integrate these control invariant sets into mixed-integer optimization, with guaranteed adherence to safety constraints and recursive feasibility at the computational level. We also present a comprehensive theoretical analysis examining the properties and guarantees of the proposed methods. Numerical simulations in an autonomous driving scenario demonstrate the methods’ effectiveness in synthesizing control-invariant sets offline and implementing model predictive control online, ensuring safety and recursive feasibility.
zh

[AI-239] LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation

【速读】:该论文旨在解决在机器人-物体交互中生成准确的未来视觉状态这一挑战,尤其是在实现高质量的像素级表示方面。其解决方案的关键在于提出LaDi-WM,一个基于扩散模型的世界模型,它通过预测潜在空间(latent space)来模拟未来状态,而非直接预测像素级图像。该潜在空间与预训练的视觉基础模型(Visual Foundation Models, VFMs)对齐,包含几何特征(基于DINO)和语义特征(基于CLIP),从而使得潜在空间的演化更易于学习且更具泛化性。

链接: https://arxiv.org/abs/2505.11528
作者: Yuhang Huang,JIazhao Zhang,Shilong Zou,XInwang Liu,Ruizhen Hu,Kai Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9% on the LIBERO-LONG benchmark and 20% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.
zh

[AI-240] OmniDrones: An Efficient and Flexible Platform for Reinforcement Learning in Drone Control

【速读】:该论文旨在解决无人机控制中强化学习(Reinforcement Learning, RL)研究缺乏高效且灵活的仿真平台的问题。解决方案的关键在于提出OmniDrones,这是一个基于Nvidia Omniverse Isaac Sim构建的开源仿真平台,采用自下而上的设计方法,支持用户在GPU并行化仿真基础上快速设计和实验多种应用场景,并提供了多种无人机模型、传感器模态、控制模式及超过10个基准任务,以促进实际无人机系统中强化学习的应用研究。

链接: https://arxiv.org/abs/2309.12825
作者: Botian Xu,Feng Gao,Chao Yu,Ruize Zhang,Yi Wu,Yu Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IEEE RA-L

点击查看摘要

Abstract:In this work, we introduce OmniDrones, an efficient and flexible platform tailored for reinforcement learning in drone control, built on Nvidia’s Omniverse Isaac Sim. It employs a bottom-up design approach that allows users to easily design and experiment with various application scenarios on top of GPU-parallelized simulations. It also offers a range of benchmark tasks, presenting challenges ranging from single-drone hovering to over-actuated system tracking. In summary, we propose an open-sourced drone simulation platform, equipped with an extensive suite of tools for drone learning. It includes 4 drone models, 5 sensor modalities, 4 control modes, over 10 benchmark tasks, and a selection of widely used RL baselines. To showcase the capabilities of OmniDrones and to support future research, we also provide preliminary results on these benchmark tasks. We hope this platform will encourage further studies on applying RL to practical drone systems.
zh

[AI-241] From What Ifs to Insights: Counterfactuals in Causal Inference vs. Explainable AI

【速读】:该论文试图解决因果推断(Causal Inference, CI)与可解释人工智能(Explainable Artificial Intelligence, XAI)领域中反事实(counterfactual)概念在应用和解释上的差异问题。其解决方案的关键在于提出一个形式化定义,以涵盖CI与XAI中反事实的多维特性,并系统比较两者在反事实生成、评估与操作化过程中的概念与实践差异,从而探索两个领域之间的交叉融合机会。

链接: https://arxiv.org/abs/2505.13324
作者: Galit Shmueli,David Martens,Jaewon Yoo,Travis Greene
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Counterfactuals play a pivotal role in the two distinct data science fields of causal inference (CI) and explainable artificial intelligence (XAI). While the core idea behind counterfactuals remains the same in both fields–the examination of what would have happened under different circumstances–there are key differences in how they are used and interpreted. We introduce a formal definition that encompasses the multi-faceted concept of the counterfactual in CI and XAI. We then discuss how counterfactuals are used, evaluated, generated, and operationalized in CI vs. XAI, highlighting conceptual and practical differences. By comparing and contrasting the two, we hope to identify opportunities for cross-fertilization across CI and XAI.
zh

[AI-242] Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR INTERSPEECH2025

【速读】:该论文旨在解决将预训练语言模型(PLM)中的语言知识迁移至声学特征学习过程中,由于语言模态与声学模态之间的固有差异导致的表示对齐问题。现有基于最优传输(OT)的方法忽略了特征向量之间的结构关系,将其视为无序集合。该论文提出的解决方案关键在于引入图匹配最优传输(GM-OT),通过将语言和声学序列建模为结构化图,其中节点表示特征嵌入,边捕捉时间与序列关系,并最小化节点间的Wasserstein距离(WD)以及边间的Gromov-Wasserstein距离(GWD),从而得到融合的Gromov-Wasserstein距离(FGWD)公式,实现更有效的结构化对齐与知识迁移。

链接: https://arxiv.org/abs/2505.13079
作者: Xugang Lu,Peng Shen,Yu Tsao,Hisashi Kawai
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: To appear in Interspeech 2025

点击查看摘要

Abstract:Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance (GWD) (between edges), leading to a fused Gromov-Wasserstein distance (FGWD) formulation. This enables structured alignment and more efficient knowledge transfer compared to existing OT-based approaches. Theoretical analysis further shows that prior OT-based methods in linguistic knowledge transfer can be viewed as a special case within our GM-OT framework. We evaluate GM-OT on Mandarin ASR using a CTC-based E2E-ASR system with a PLM for knowledge transfer. Experimental results demonstrate significant performance gains over state-of-the-art models, validating the effectiveness of our approach.
zh

[AI-243] Multi-View Wireless Sensing via Conditional Generative Learning: Framework and Model Design

【速读】:该论文试图解决基于多视角信道状态信息(multi-view CSI)的高精度目标感知问题,旨在通过融合多个基站(BS)与用户设备(UE)之间的CSI数据,提升目标形状和电磁(EM)特性重建的质量。解决方案的关键在于设计一种双部分神经网络架构,其中第一部分使用精心设计的编码器融合多视角CSI中嵌入的目标潜在特征,第二部分则利用这些特征作为条件输入,引导生成模型进行目标重建。编码器被设计为能够捕捉CSI与目标之间的物理相关性,并适应BS-UE对的数量和位置变化,同时通过引入空间位置嵌入方案来吸收CSI的视图特异性,最终采用带有加权损失的条件扩散模型生成目标点云。

链接: https://arxiv.org/abs/2505.12664
作者: Ziqing Xing,Zhaoyang Zhang,Zirui Chen,Hongning Ruan,Zhaohui Yang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:In this paper, we incorporate physical knowledge into learning-based high-precision target sensing using the multi-view channel state information (CSI) between multiple base stations (BSs) and user equipment (UEs). Such kind of multi-view sensing problem can be naturally cast into a conditional generation framework. To this end, we design a bipartite neural network architecture, the first part of which uses an elaborately designed encoder to fuse the latent target features embedded in the multi-view CSI, and then the second uses them as conditioning inputs of a powerful generative model to guide the target’s reconstruction. Specifically, the encoder is designed to capture the physical correlation between the CSI and the target, and also be adaptive to the numbers and positions of BS-UE pairs. Therein the view-specific nature of CSI is assimilated by introducing a spatial positional embedding scheme, which exploits the structure of electromagnetic(EM)-wave propagation channels. Finally, a conditional diffusion model with a weighted loss is employed to generate the target’s point cloud from the fused features. Extensive numerical results demonstrate that the proposed generative multi-view (Gen-MV) sensing framework exhibits excellent flexibility and significant performance improvement on the reconstruction quality of target’s shape and EM properties.
zh

[AI-244] ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data

【速读】:该论文试图解决在单细胞ATAC测序(scATAC-seq)领域缺乏能够支持零样本高质量细胞识别和全面多组学分析的基础模型的问题。关键挑战在于scATAC-seq数据的高维度和稀疏性,以及开放染色质区域(OCRs)缺乏标准化表示。论文提出的解决方案是构建一个名为ChromFound的基础模型,其采用混合架构和基因组感知的分词方法,以有效捕捉全基因组范围内的长上下文和调控信号,从而实现对多种任务的广泛适用性和强大的跨组学迁移能力。

链接: https://arxiv.org/abs/2505.12638
作者: Yifeng Jiao,Yuchen Liu,Yu Zhang,Xin Guo,Yushuai Wu,Chen Jiang,Jiyang Li,Hongwei Zhang,Limei Han,Xin Gao,Yuan Qi,Yuan Cheng
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present \textbfChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.
zh

[AI-245] scSiameseClu: A Siamese Clustering Framework for Interpreting single-cell RNA Sequencing Data

【速读】:该论文旨在解决单细胞RNA测序(scRNA-seq)数据中由于噪声、稀疏性和高维度带来的分析挑战,以及图神经网络(GNNs)在处理此类数据时易出现的过平滑问题。其解决方案的关键在于提出scSiameseClu框架,该框架包含三个核心步骤:双增强模块通过生物学启发的扰动提升表示鲁棒性;Siamese融合模块结合交叉相关性精炼与自适应信息融合以捕捉复杂细胞关系并缓解过平滑;最优传输聚类利用Sinkhorn距离高效对齐聚类结果与预定义比例,保持平衡。

链接: https://arxiv.org/abs/2505.12626
作者: Ping Xu,Zhiyuan Ning,Pengjiang Li,Wenhao Liu,Pengyang Wang,Jiaxu Cui,Yuanchun Zhou,Pengfei Wang
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that~\methodname~outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.
zh

[AI-246] Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis

【速读】:该论文试图解决基于流匹配(Flow Matching, FM)的文本到语音(Text-to-Speech, TTS)模型在生成过程中自然度不足以及推理效率低的问题。解决方案的关键在于提出一种浅层流匹配(Shallow Flow Matching, SFM)机制,通过在FM路径上构建中间状态,并利用正交投影方法自适应确定这些状态的时间位置,结合单段分段流的原理进行构造,从而在推理阶段从中间状态而非纯噪声开始生成,集中计算资源于FM路径的后期阶段,有效提升了语音合成的自然度并降低了使用自适应步长常微分方程求解器时的推理开销。

链接: https://arxiv.org/abs/2505.12226
作者: Dong Yang,Yiyi Cai,Yuki Saito,Lixu Wang,Hiroshi Saruwatari
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We propose a shallow flow matching (SFM) mechanism to enhance flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. SFM constructs intermediate states along the FM paths using coarse output representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise and focuses computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments show that SFM consistently improves the naturalness of synthesized speech in both objective and subjective evaluations, while significantly reducing inference when using adaptive-step ODE solvers. Demo and codes are available at this https URL.
zh

[AI-247] Fine-Grained ECG-Text Contrastive Learning via Waveform Understanding Enhancement

【速读】:该论文试图解决现有心电图(Electrocardiogram, ECG)文本对比学习方法中报告不完整导致模型难以捕捉ECG波形特征及诊断推理的问题。其解决方案的关键在于提出FG-CLEP(Fine-Grained Contrastive Language ECG Pre-training),通过大语言模型(Large Language Models, LLMs)从不完整的报告中恢复波形特征,并引入语义相似性矩阵和基于Sigmoid的损失函数以应对幻觉问题、波形特征与诊断之间的非双射关系以及ECG任务的多标签特性。

链接: https://arxiv.org/abs/2505.11939
作者: Haitao Li,Che Liu,Zhengyao Ding,Ziyi Liu,Zhengxing Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electrocardiograms (ECGs) are essential for diagnosing cardiovascular diseases. While previous ECG-text contrastive learning methods have shown promising results, they often overlook the incompleteness of the reports. Given an ECG, the report is generated by first identifying key waveform features and then inferring the final diagnosis through these features. Despite their importance, these waveform features are often not recorded in the report as intermediate results. Aligning ECGs with such incomplete reports impedes the model’s ability to capture the ECG’s waveform features and limits its understanding of diagnostic reasoning based on those features. To address this, we propose FG-CLEP (Fine-Grained Contrastive Language ECG Pre-training), which aims to recover these waveform features from incomplete reports with the help of large language models (LLMs), under the challenges of hallucinations and the non-bijective relationship between waveform features and diagnoses. Additionally, considering the frequent false negatives due to the prevalence of common diagnoses in ECGs, we introduce a semantic similarity matrix to guide contrastive learning. Furthermore, we adopt a sigmoid-based loss function to accommodate the multi-label nature of ECG-related tasks. Experiments on six datasets demonstrate that FG-CLEP outperforms state-of-the-art methods in both zero-shot prediction and linear probing across these datasets.
zh

[AI-248] Exploring the Potential of SSL Models for Sound Event Detection

【速读】:该论文旨在解决声事件检测(Sound Event Detection, SED)中自监督学习(Self-supervised Learning, SSL)模型的协同潜力未被充分挖掘的问题。其解决方案的关键在于提出一种融合异构SSL表示的框架,通过三种融合策略——单个SSL嵌入集成、双模态融合和全聚合——来优化模型选择与集成,从而提升SED性能。实验表明,双模态融合(如CRNN+BEATs+WavLM)能够实现互补性能提升,而CRNN+BEATs在单独SSL模型中表现最佳。此外,引入的归一化声事件边界框(nSEBBs)作为自适应后处理方法,进一步提升了独立SSL模型的性能。

链接: https://arxiv.org/abs/2505.11889
作者: Hanfang Cui,Longfei Song,Li Li,Dongxing Xu,Yanhua Long
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 27 pages, 5 figures, submitted to the Journal of King Saud University - Computer and Information Sciences (under review)

点击查看摘要

Abstract:Self-supervised learning (SSL) models offer powerful representations for sound event detection (SED), yet their synergistic potential remains underexplored. This study systematically evaluates state-of-the-art SSL models to guide optimal model selection and integration for SED. We propose a framework that combines heterogeneous SSL representations (e.g., BEATs, HuBERT, WavLM) through three fusion strategies: individual SSL embedding integration, dual-modal fusion, and full aggregation. Experiments on the DCASE 2023 Task 4 Challenge reveal that dual-modal fusion (e.g., CRNN+BEATs+WavLM) achieves complementary performance gains, while CRNN+BEATs alone delivers the best results among individual SSL models. We further introduce normalized sound event bounding boxes (nSEBBs), an adaptive post-processing method that dynamically adjusts event boundary predictions, improving PSDS1 by up to 4% for standalone SSL models. These findings highlight the compatibility and complementarity of SSL architectures, providing guidance for task-specific fusion and robust SED system design.
zh

[AI-249] Improving Medium Range Severe Weather Prediction through Transformer Post-processing of AI Weather Forecasts

【速读】:该论文旨在解决中短期(1-8天)极端天气预测技能不足的问题,以减少其对社会的影响。解决方案的关键在于利用仅解码器的Transformer网络对基于人工智能的天气预报进行后处理,特别是针对Pangu-Weather模型的输出。与传统方法不同,该方法将预报时效视为序列“标记”,使Transformer能够学习大气状态演变中的复杂时间关系,从而提升极端天气预报的准确性与可靠性。

链接: https://arxiv.org/abs/2505.11750
作者: Zhanxiang Hua,Ryan Sobash,David John Gagne II,Yingkai Sha,Alexandra Anderson-Frey
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Improving the skill of medium-range (1-8 day) severe weather prediction is crucial for mitigating societal impacts. This study introduces a novel approach leveraging decoder-only transformer networks to post-process AI-based weather forecasts, specifically from the Pangu-Weather model, for improved severe weather guidance. Unlike traditional post-processing methods that use a dense neural network to predict the probability of severe weather using discrete forecast samples, our method treats forecast lead times as sequential ``tokens’', enabling the transformer to learn complex temporal relationships within the evolving atmospheric state. We compare this approach against post-processing of the Global Forecast System (GFS) using both a traditional dense neural network and our transformer, as well as configurations that exclude convective parameters to fairly evaluate the impact of using the Pangu-Weather AI model. Results demonstrate that the transformer-based post-processing significantly enhances forecast skill compared to dense neural networks. Furthermore, AI-driven forecasts, particularly Pangu-Weather initialized from high resolution analysis, exhibit superior performance to GFS in the medium-range, even without explicit convective parameters. Our approach offers improved accuracy, and reliability, which also provides interpretability through feature attribution analysis, advancing medium-range severe weather prediction capabilities.
zh

[AI-250] Programmable metasurfaces for future photonic artificial intelligence

【速读】:该论文试图解决可扩展的光子人工智能(Photonic AI)解决方案的难题,特别是如何使基于光子神经网络(PNNs)的大规模光学AI模型在商业上可行。论文指出,光子计算的优势必须超过输入输出开销的成本,才能实现商业化。其解决方案的关键在于场可编程超表面(field-programmable metasurface)技术,该技术作为关键硬件组件,能够提升PNN的可扩展性和功能,并通过可编程性或可重构性支持原位训练和适应非平稳应用场景,同时通过与电子学的共集成、三维堆叠及大规模制造进一步增强PNN的性能。

链接: https://arxiv.org/abs/2505.11659
作者: Loubnan Abou-Hamdan,Emil Marinov,Peter Wiecha,Philipp del Hougne,Tianyu Wang,Patrice Genevet
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: Nat. Rev. Phys. (2025)

点击查看摘要

Abstract:Photonic neural networks (PNNs), which share the inherent benefits of photonic systems, such as high parallelism and low power consumption, could challenge traditional digital neural networks in terms of energy efficiency, latency, and throughput. However, producing scalable photonic artificial intelligence (AI) solutions remains challenging. To make photonic AI models viable, the scalability problem needs to be solved. Large optical AI models implemented on PNNs are only commercially feasible if the advantages of optical computation outweigh the cost of their input-output overhead. In this Perspective, we discuss how field-programmable metasurface technology may become a key hardware ingredient in achieving scalable photonic AI accelerators and how it can compete with current digital electronic technologies. Programmability or reconfigurability is a pivotal component for PNN hardware, enabling in situ training and accommodating non-stationary use cases that require fine-tuning or transfer learning. Co-integration with electronics, 3D stacking, and large-scale manufacturing of metasurfaces would significantly improve PNN scalability and functionalities. Programmable metasurfaces could address some of the current challenges that PNNs face and enable next-generation photonic AI technology.
zh

[AI-251] BioCube: A Multimodal Dataset for Biodiversity Research

【速读】:该论文试图解决生态学和生物多样性研究中对完整、详细信息的需求,以在不同尺度上研究生态系统动态。解决方案的关键在于构建一个多模态、细粒度的全球数据集BioCube,该数据集整合了通过图像、音频记录和描述的物种观测数据、环境DNA、植被指数、农业、森林、土地指标以及高分辨率气候变量,并在WGS84地理系统下进行地理空间对齐,时间跨度为2000年至2020年。

链接: https://arxiv.org/abs/2505.11568
作者: Stylianos Stasinos,Martino Mensio,Elena Lazovik,Athanasios Trantas
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to BiDS’25, 5 pages, 1 figure

点击查看摘要

Abstract:Biodiversity research requires complete and detailed information to study ecosystem dynamics at different scales. Employing data-driven methods like Machine Learning is getting traction in ecology and more specific biodiversity, offering alternative modelling pathways. For these methods to deliver accurate results there is the need for large, curated and multimodal datasets that offer granular spatial and temporal resolutions. In this work, we introduce BioCube, a multimodal, fine-grained global dataset for ecology and biodiversity research. BioCube incorporates species observations through images, audio recordings and descriptions, environmental DNA, vegetation indices, agricultural, forest, land indicators, and high-resolution climate variables. All observations are geospatially aligned under the WGS84 geodetic system, spanning from 2000 to 2020. The dataset will become available at this https URL while the acquisition and processing code base at this https URL.
zh

[AI-252] Analysis and Resilience of the U.S. Flight Network

【速读】:该论文试图解决美国航空运输网络(U.S. Flight Network, USFN)的效率与脆弱性问题,通过复杂网络理论分析其拓扑结构对网络性能的影响。解决方案的关键在于识别网络的结构性特征,如度分布、聚类系数和模块性,并揭示其在面对目标攻击和关键枢纽失效时的脆弱性。研究发现,USFN遵循幂律分布且具有枢纽主导特性,相较于随机网络更具聚类性和模块性,但同时也表现出对主要枢纽失效的高度敏感性,因此保护关键枢纽机场是提升网络鲁棒性的核心策略。

链接: https://arxiv.org/abs/2505.11559
作者: Sushrit Kafle,Shreejan Pandey
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Investigates resilience of the U.S. flight network under node failures. Includes percolation threshold detection, cascade simulations, and community structure analysis. 9 pages, 14 figures

点击查看摘要

Abstract:Air travel is one of the most widely used transportation services in the United States. This paper analyzes the U.S. Flight Network (USFN) using complex network theory by exploring how the network’s topology contributes to its efficiency and vulnerability. This is done by examining the structural properties, degree distributions, and community structures in the network. USFN was observed to follow power-law distribution and falls under the anomalous regime, suggesting that the network is hub dominant. Compared to null networks, USFN has a higher clustering coefficient and modularity. Various percolation test revealed that USFN is vulnerable to targeted attacks and is susceptible to complete cascading failure if one of the major hubs fails. The overall results suggest that while the USFN is designed for efficiency, it is highly vulnerable to disruptions. Protecting key hub airports is important to make the network more robust and prevent large-scale failures.
zh

[AI-253] Code Retrieval for MILP Instance Generation

【速读】:该论文旨在解决混合整数线性规划(Mixed-Integer Linear Programming, MILP)求解器性能提升中数据生成效率低和泛化能力不足的问题。现有MILP实例生成方法通常需要为每个问题类别单独训练模型,并且在生成新实例时计算成本较高。论文的关键解决方案是将MILP实例生成任务重新定义为MILP代码生成任务,通过代码实现高效、灵活且可解释的实例生成。此外,论文引入了MILP-EmbedSim相似性度量,以准确评估同一问题类别下不同规模实例之间的相似性,并基于此提出MILP-Retrieval框架,从代码库中检索生成代码以生成与目标实例高度相似的MILP实例。

链接: https://arxiv.org/abs/2505.11526
作者: Tianxing Yang,Huigen Ye,Hua Xu
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixed-Integer Linear Programming (MILP) is widely used in fields such as scheduling, logistics, and planning. Enhancing the performance of MILP solvers, particularly learning-based solvers, requires substantial amounts of high-quality data. However, existing methods for MILP instance generation typically necessitate training a separate model for each problem class and are computationally intensive when generating new instances. To address these limitations, we reformulate the MILP Instance Generation task as MILP Code Generation task, enabling efficient, flexible, and interpretable instance generation through code. Since MILP instances generated from code can vary significantly in scale, we introduce MILP-EmbedSim, a new similarity metric that accurately measures the similarity between instances of varying sizes within the same problem class. Leveraging this metric, we propose MILP-Retrieval, a pipeline that retrieves generation code from library to produce MILP instances highly similar to target instance. MILP-Retrieval outperforms baselines in both MILP Code Generation and Instance Generation tasks, provides a novel perspective on MILP instance generation and opens new possibilities for learning-based solvers.
zh

[AI-254] Decentralized Traffic Flow Optimization Through Intrinsic Motivation ITSC

【速读】:该论文试图解决城市交通拥堵问题,尤其是在快速发展的超大城市中,交通拥堵问题日益严重。其解决方案的关键在于引入基于赋能原则(empowerment)的内在动机来控制自动驾驶汽车的行为,从而优化交通流。该方法在保持去中心化特性的同时,仅依赖局部信息进行决策,无需显式协调,有效提升了整体交通流量,缓解了拥堵,并减少了平均交通阻塞时间。

链接: https://arxiv.org/abs/2505.11520
作者: Himaja Papala,Daniel Polani,Stas Tiomkin
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, Published in the Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)

点击查看摘要

Abstract:Traffic congestion has long been an ubiquitous problem that is exacerbating with the rapid growth of megacities. In this proof-of-concept work we study intrinsic motivation, implemented via the empowerment principle, to control autonomous car behavior to improve traffic flow. In standard models of traffic dynamics, self-organized traffic jams emerge spontaneously from the individual behavior of cars, affecting traffic over long distances. Our novel car behavior strategy improves traffic flow while still being decentralized and using only locally available information without explicit coordination. Decentralization is essential for various reasons, not least to be able to absorb robustly substantial levels of uncertainty. Our scenario is based on the well-established traffic dynamics model, the Nagel-Schreckenberg cellular automaton. In a fraction of the cars in this model, we substitute the default behavior by empowerment, our intrinsic motivation-based method. This proposed model significantly improves overall traffic flow, mitigates congestion, and reduces the average traffic jam time.
zh

机器学习

[LG-0] Unlocking Non-Invasive Brain-to-Text

链接: https://arxiv.org/abs/2505.13446
作者: Dulhan Jayalath,Gilad Landau,Oiwi Parker Jones
类目: Machine Learning (cs.LG)
*备注: 27 pages, 10 figures, 10 tables. Under review

点击查看摘要

Abstract:Despite major advances in surgical brain-to-text (B2T), i.e. transcribing speech from invasive brain recordings, non-invasive alternatives have yet to surpass even chance on standard metrics. This remains a barrier to building a non-invasive brain-computer interface (BCI) capable of restoring communication in paralysed individuals without surgery. Here, we present the first non-invasive B2T result that significantly exceeds these critical baselines, raising BLEU by 1.4\mathrm-2.6\times over prior work. This result is driven by three contributions: (1) we extend recent word-classification models with LLM-based rescoring, transforming single-word predictors into closed-vocabulary B2T systems; (2) we introduce a predictive in-filling approach to handle out-of-vocabulary (OOV) words, substantially expanding the effective vocabulary; and (3) we demonstrate, for the first time, how to scale non-invasive B2T models across datasets, unlocking deep learning at scale and improving accuracy by 2.1\mathrm-2.3\times . Through these contributions, we offer new insights into the roles of data quality and vocabulary size. Together, our results remove a major obstacle to realising practical non-invasive B2T systems.

[LG-1] Synthetic-Powered Predictive Inference

链接: https://arxiv.org/abs/2505.13432
作者: Meshi Bashari,Roy Maor Lotan,Yonghoon Lee,Edgar Dobriban,Yaniv Romano
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal prediction is a framework for predictive inference with a distribution-free, finite-sample guarantee. However, it tends to provide uninformative prediction sets when calibration data are scarce. This paper introduces Synthetic-powered predictive inference (SPPI), a novel framework that incorporates synthetic data – e.g., from a generative model – to improve sample efficiency. At the core of our method is a score transporter: an empirical quantile mapping that aligns nonconformity scores from trusted, real data with those from synthetic data. By carefully integrating the score transporter into the calibration process, SPPI provably achieves finite-sample coverage guarantees without making any assumptions about the real and synthetic data distributions. When the score distributions are well aligned, SPPI yields substantially tighter and more informative prediction sets than standard conformal prediction. Experiments on image classification and tabular regression demonstrate notable improvements in predictive efficiency in data-scarce settings.

[LG-2] Make Still Further Progress: Chain of Thoughts for Tabular Data Leaderboard

链接: https://arxiv.org/abs/2505.13421
作者: Si-Yang Liu,Qile Zhou,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data, a fundamental data format in machine learning, is predominantly utilized in competitions and real-world applications. The performance of tabular models–such as gradient boosted decision trees and neural networks–can vary significantly across datasets due to differences in feature distributions and task characteristics. Achieving top performance on each dataset often requires specialized expert knowledge. To address this variability, practitioners often aggregate the predictions of multiple models. However, conventional aggregation strategies typically rely on static combination rules and lack instance-level adaptability. In this work, we propose an in-context ensemble framework for tabular prediction that leverages large language models (LLMs) to perform dynamic, instance-specific integration of external model predictions. Without access to raw tabular features or semantic information, our method constructs a context around each test instance using its nearest neighbors and the predictions from a pool of external models. Within this enriched context, we introduce Chain of Tabular Thoughts (CoT ^2 ), a prompting strategy that guides LLMs through multi-step, interpretable reasoning, making still further progress toward expert-level decision-making. Experimental results show that our method outperforms well-tuned baselines and standard ensemble techniques across a wide range of tabular datasets.

[LG-3] Gluon: Making Muon Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLM s)

链接: https://arxiv.org/abs/2505.13416
作者: Artem Riabinin,Egor Shulgin,Kaja Gruntkowska,Peter Richtárik
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as \sf Muon and \sf Scion . After over a decade of \sf Adam 's dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based method called \sf Gluon , capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of \sf Muon and \sf Scion , and leads to convergence guarantees with strong practical predictive power. Unlike prior results, our theoretical stepsizes closely match the fine-tuned values reported by Pethick et al. (2025). Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.

[LG-4] Joint Velocity-Growth Flow Matching for Single-Cell Dynamics Modeling

链接: https://arxiv.org/abs/2505.13413
作者: Dongyi Wang,Yuanwei Jiang,Zhenyi Zhang,Xiang Gu,Peijie Zhou,Jian Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning the underlying dynamics of single cells from snapshot data has gained increasing attention in scientific and machine learning research. The destructive measurement technique and cell proliferation/death result in unpaired and unbalanced data between snapshots, making the learning of the underlying dynamics challenging. In this paper, we propose joint Velocity-Growth Flow Matching (VGFM), a novel paradigm that jointly learns state transition and mass growth of single-cell populations via flow matching. VGFM builds an ideal single-cell dynamics containing velocity of state and growth of mass, driven by a presented two-period dynamic understanding of the static semi-relaxed optimal transport, a mathematical tool that seeks the coupling between unpaired and unbalanced data. To enable practical usage, we approximate the ideal dynamics using neural networks, forming our joint velocity and growth matching framework. A distribution fitting loss is also employed in VGFM to further improve the fitting performance for snapshot data. Extensive experimental results on both synthetic and real datasets demonstrate that VGFM can capture the underlying biological dynamics accounting for mass and state variations over time, outperforming existing approaches for single-cell dynamics modeling.

[LG-5] A Dataless Reinforcement Learning Approach to Rounding Hyperplane Optimization for Max-Cut

链接: https://arxiv.org/abs/2505.13405
作者: Gabriel Malikal,Ismail Alkhouri,Alvaro Velasquez,Adam M Alessio,Saiprasad Ravishankar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Maximum Cut (MaxCut) problem is NP-Complete, and obtaining its optimal solution is NP-hard in the worst case. As a result, heuristic-based algorithms are commonly used, though their design often requires significant domain expertise. More recently, learning-based methods trained on large (un)labeled datasets have been proposed; however, these approaches often struggle with generalizability and scalability. A well-known approximation algorithm for MaxCut is the Goemans-Williamson (GW) algorithm, which relaxes the Quadratic Unconstrained Binary Optimization (QUBO) formulation into a semidefinite program (SDP). The GW algorithm then applies hyperplane rounding by uniformly sampling a random hyperplane to convert the SDP solution into binary node assignments. In this paper, we propose a training-data-free approach based on a non-episodic reinforcement learning formulation, in which an agent learns to select improved rounding hyperplanes that yield better cuts than those produced by the GW algorithm. By optimizing over a Markov Decision Process (MDP), our method consistently achieves better cuts across large-scale graphs with varying densities and degree distributions.

[LG-6] Learning by solving differential equations

链接: https://arxiv.org/abs/2505.13397
作者: Benoit Dherin,Michael Munn,Hanna Mazzawi,Michael Wunder,Sourabh Medapati,Javier Gonzalvo
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern deep learning algorithms use variations of gradient descent as their main learning methods. Gradient descent can be understood as the simplest Ordinary Differential Equation (ODE) solver; namely, the Euler method applied to the gradient flow differential equation. Since Euler, many ODE solvers have been devised that follow the gradient flow equation more precisely and more stably. Runge-Kutta (RK) methods provide a family of very powerful explicit and implicit high-order ODE solvers. However, these higher-order solvers have not found wide application in deep learning so far. In this work, we evaluate the performance of higher-order RK solvers when applied in deep learning, study their limitations, and propose ways to overcome these drawbacks. In particular, we explore how to improve their performance by naturally incorporating key ingredients of modern neural network optimizers such as preconditioning, adaptive learning rates, and momentum.

[LG-7] Restoration Score Distillation: From Corrupted Diffusion Pretraining to One-Step High-Quality Generation

链接: https://arxiv.org/abs/2505.13377
作者: Yasi Zhang,Tianyu Chen,Zhendong Wang,Ying Nian Wu,Mingyuan Zhou,Oscar Leong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning generative models from corrupted data is a fundamental yet persistently challenging task across scientific disciplines, particularly when access to clean data is limited or expensive. Denoising Score Distillation (DSD) \citechen2025denoising recently introduced a novel and surprisingly effective strategy that leverages score distillation to train high-fidelity generative models directly from noisy observations. Building upon this foundation, we propose \textitRestoration Score Distillation (RSD), a principled generalization of DSD that accommodates a broader range of corruption types, such as blurred, incomplete, or low-resolution images. RSD operates by first pretraining a teacher diffusion model solely on corrupted data and subsequently distilling it into a single-step generator that produces high-quality reconstructions. Empirically, RSD consistently surpasses its teacher model across diverse restoration tasks on both natural and scientific datasets. Moreover, beyond standard diffusion objectives, the RSD framework is compatible with several corruption-aware training techniques such as Ambient Tweedie, Ambient Diffusion, and its Fourier-space variant, enabling flexible integration with recent advances in diffusion modeling. Theoretically, we demonstrate that in a linear regime, RSD recovers the eigenspace of the clean data covariance matrix from linear measurements, thereby serving as an implicit regularizer. This interpretation recasts score distillation not only as a sampling acceleration technique but as a principled approach to enhancing generative performance in severely degraded data regimes.

[LG-8] Introducing Instruction-Accurate Simulators for Performance Estimation of Autotuning Workloads

链接: https://arxiv.org/abs/2505.13357
作者: Rebecca Pelke,Nils Bosbach,Lennart M. Reimann,Rainer Leupers
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accelerating Machine Learning (ML) workloads requires efficient methods due to their large optimization space. Autotuning has emerged as an effective approach for systematically evaluating variations of implementations. Traditionally, autotuning requires the workloads to be executed on the target hardware (HW). We present an interface that allows executing autotuning workloads on simulators. This approach offers high scalability when the availability of the target HW is limited, as many simulations can be run in parallel on any accessible HW. Additionally, we evaluate the feasibility of using fast instruction-accurate simulators for autotuning. We train various predictors to forecast the performance of ML workload implementations on the target HW based on simulation statistics. Our results demonstrate that the tuned predictors are highly effective. The best workload implementation in terms of actual run time on the target HW is always within the top 3 % of predictions for the tested x86, ARM, and RISC-V-based architectures. In the best case, this approach outperforms native execution on the target HW for embedded architectures when running as few as three samples on three simulators in parallel.

[LG-9] Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference ICML2025

链接: https://arxiv.org/abs/2505.13345
作者: Shuqing Luo,Pingzhi Li,Jie Peng,Hanrui Wang,Yang(Katie)Zhao, Yu (Kevin)Cao,Yu Cheng,Tianlong Chen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by ICML2025

点击查看摘要

Abstract:Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over 40% runtime in large-scale training). In this paper, we first define collaborative communication to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them “collaborated”, which comprises 2 cases as intra- and inter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallelism at scale. It motivates us to strategically optimize collaborative communication for accelerated MoE training and inference, dubbed Occult. Our designs are capable of either delivering exact results with reduced communication cost or controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that Occult can be faster than popular state-of-the-art inference or training frameworks (more than 1.5\times speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Code is available at \hrefthis https URLthis https URL .

[LG-10] MRM3: Machine Readable ML Model Metadata

链接: https://arxiv.org/abs/2505.13343
作者: Andrej Čop,Blaž Bertalanič,Marko Grobelnik,Carolina Fortuna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the complexity and number of machine learning (ML) models grows, well-documented ML models are essential for developers and companies to use or adapt them to their specific use cases. Model metadata, already present in unstructured format as model cards in online repositories such as Hugging Face, could be more structured and machine readable while also incorporating environmental impact metrics such as energy consumption and carbon footprint. Our work extends the existing State of the Art by defining a structured schema for ML model metadata focusing on machine-readable format and support for integration into a knowledge graph (KG) for better organization and querying, enabling a wider set of use cases. Furthermore, we present an example wireless localization model metadata dataset consisting of 22 models trained on 4 datasets, integrated into a Neo4j-based KG with 113 nodes and 199 relations.

[LG-11] Detect and Correct: A Selective Noise Correction Method for Learning with Noisy Labels

链接: https://arxiv.org/abs/2505.13342
作者: Yuval Grinberg,Nimrod Harel,Jacob Goldberger,Ofir Lindenbaum
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Falsely annotated samples, also known as noisy labels, can significantly harm the performance of deep learning models. Two main approaches for learning with noisy labels are global noise estimation and data filtering. Global noise estimation approximates the noise across the entire dataset using a noise transition matrix, but it can unnecessarily adjust correct labels, leaving room for local improvements. Data filtering, on the other hand, discards potentially noisy samples but risks losing valuable data. Our method identifies potentially noisy samples based on their loss distribution. We then apply a selection process to separate noisy and clean samples and learn a noise transition matrix to correct the loss for noisy samples while leaving the clean data unaffected, thereby improving the training process. Our approach ensures robust learning and enhanced model performance by preserving valuable information from noisy samples and refining the correction process. We applied our method to standard image datasets (MNIST, CIFAR-10, and CIFAR-100) and a biological scRNA-seq cell-type annotation dataset. We observed a significant improvement in model accuracy and robustness compared to traditional methods.

[LG-12] Measuring Social Influence with Networked Synthetic Control

链接: https://arxiv.org/abs/2505.13334
作者: Ho-Chun Herbert Chang
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Measuring social influence is difficult due to the lack of counter-factuals and comparisons. By combining machine learning-based modeling and network science, we present general properties of social value, a recent measure for social influence using synthetic control applicable to political behavior. Social value diverges from centrality measures on in that it relies on an external regressor to predict an output variable of interest, generates a synthetic measure of influence, then distributes individual contribution based on a social network. Through theoretical derivations, we show the properties of SV under linear regression with and without interaction, across lattice networks, power-law networks, and random graphs. A reduction in computation can be achieved for any ensemble model. Through simulation, we find that the generalized friendship paradox holds – that in certain situations, your friends have on average more influence than you do.

[LG-13] hinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately

链接: https://arxiv.org/abs/2505.13326
作者: Yuhang Wang,Youhe Jiang,Bin Cui,Fangcheng Fu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in test-time scaling suggest that Large Language Models (LLMs) can gain better capabilities by generating Chain-of-Thought reasoning (analogous to human thinking) to respond a given request, and meanwhile exploring more reasoning branches (i.e., generating multiple responses and ensembling them) can improve the final output quality. However, when incorporating the two scaling dimensions, we find that the system efficiency is dampened significantly for two reasons. Firstly, the time cost to generate the final output increases substantially as many reasoning branches would be trapped in the over-thinking dilemma, producing excessively long responses. Secondly, generating multiple reasoning branches for each request increases memory consumption, which is unsuitable for LLM serving since we can only batch a limited number of requests to process simultaneously. To address this, we present SART, a serving framework for efficient and accurate LLM reasoning. The essential idea is to manage the thinking to be short and right, rather than long. For one thing, we devise a redundant sampling with early stopping approach based on empirical observations and theoretic analysis, which increases the likelihood of obtaining short-thinking responses when sampling reasoning branches. For another, we propose to dynamically prune low-quality branches so that only right-thinking branches are maintained, reducing the memory consumption and allowing us to batch more requests. Experimental results demonstrate that SART not only improves the accuracy of LLM reasoning but also enhances the serving efficiency, outperforming existing methods by up to 28.2 times and on average 15.7 times in terms of efficiency when achieving the same level of accuracy.

[LG-14] Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning

链接: https://arxiv.org/abs/2505.13317
作者: Song-Lin Li,Rui Zhu,Yu-Feng Li,Lan-Zhe Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) alleviates the cost of data labeling process by exploiting unlabeled data, and has achieved promising results on various tasks such as image classification. Meanwhile, the Pretrain-Finetuning paradigm has garnered significant attention in recent years, and exploiting pre-trained models could also reduce the requirement of labeled data in downstream tasks. Therefore, a question naturally occurs: \emphWhen the labeled data is scarce in the target tasks, should we exploit unlabeled data or pre-trained models? To answer this question, we select pre-trained Vision-Language Models (VLMs) as representative pretrain-finetuning instances and propose \textitFew-shot SSL – a framework that enables fair comparison between these two paradigms by controlling the amount of labeled data used. Extensive experiments across various settings demonstrate that pre-trained VLMs generally outperform SSL methods in nearly all cases, except when the data has low resolution or lacks clear semantic structure. Therefore, we encourage future SSL research to compare with pre-trained models and explore deeper integration, such as using pre-trained knowledge to enhance pseudo-labeling. To support future research, we release our unified reproduction and evaluation framework. Codes are available at this https URL

[LG-15] Neural Functional: Learning Function to Scalar Maps for Neural PDE Surrogates

链接: https://arxiv.org/abs/2505.13275
作者: Anthony Zhou,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注: 19 Pages, 7 Figures. Code and datasets are at this http URL

点击查看摘要

Abstract:Many architectures for neural PDE surrogates have been proposed in recent years, largely based on neural networks or operator learning. In this work, we derive and propose a new architecture, the Neural Functional, which learns function to scalar mappings. Its implementation leverages insights from operator learning and neural fields, and we show the ability of neural functionals to implicitly learn functional derivatives. For the first time, this allows for an extension of Hamiltonian mechanics to neural PDE surrogates by learning the Hamiltonian functional and optimizing its functional derivatives. We demonstrate that the Hamiltonian Neural Functional can be an effective surrogate model through improved stability and conserving energy-like quantities on 1D and 2D PDEs. Beyond PDEs, functionals are prevalent in physics; functional approximation and learning with its gradients may find other uses, such as in molecular dynamics or design optimization.

[LG-16] RN-F: A Novel Approach for Mitigating Contaminated Data in Large Language Models

链接: https://arxiv.org/abs/2505.13249
作者: Le Vu Anh,Dinh Duc Nha Nguyen,Phi Long Nguyen
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have become foundational in modern artificial intelligence, powering a wide range of applications from code generation and virtual assistants to scientific research and enterprise automation. However, concerns about data contamination–where test data overlaps with training data–have raised serious questions about the reliability of these applications. Despite awareness of this issue, existing methods fall short in effectively identifying or mitigating contamination. In this paper, we propose Residual-Noise Fingerprinting (RN-F), a novel framework for detecting contaminated data in LLMs. RN-F is a single-pass, gradient-free detection method that leverages residual signal patterns without introducing additional floating-point operations. Our approach is lightweight, model-agnostic, and efficient. We evaluate RN-F on multiple LLMs across various contaminated datasets and show that it consistently outperforms existing state-of-the-art methods, achieving performance improvements of up to 10.5% in contamination detection metrics.

[LG-17] Reconstructing Physics-Informed Machine Learning for Traffic Flow Modeling: a Multi-Gradient Descent and Pareto Learning Approach

链接: https://arxiv.org/abs/2505.13241
作者: Yuan-Zheng Lei,Yaobang Gong,Dianwei Chen,Yao Cheng,Xianfeng Terry Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Physics-informed machine learning (PIML) is crucial in modern traffic flow modeling because it combines the benefits of both physics-based and data-driven approaches. In conventional PIML, physical information is typically incorporated by constructing a hybrid loss function that combines data-driven loss and physics loss through linear scalarization. The goal is to find a trade-off between these two objectives to improve the accuracy of model predictions. However, from a mathematical perspective, linear scalarization is limited to identifying only the convex region of the Pareto front, as it treats data-driven and physics losses as separate objectives. Given that most PIML loss functions are non-convex, linear scalarization restricts the achievable trade-off solutions. Moreover, tuning the weighting coefficients for the two loss components can be both time-consuming and computationally challenging. To address these limitations, this paper introduces a paradigm shift in PIML by reformulating the training process as a multi-objective optimization problem, treating data-driven loss and physics loss independently. We apply several multi-gradient descent algorithms (MGDAs), including traditional multi-gradient descent (TMGD) and dual cone gradient descent (DCGD), to explore the Pareto front in this multi-objective setting. These methods are evaluated on both macroscopic and microscopic traffic flow models. In the macroscopic case, MGDAs achieved comparable performance to traditional linear scalarization methods. Notably, in the microscopic case, MGDAs significantly outperformed their scalarization-based counterparts, demonstrating the advantages of a multi-objective optimization approach in complex PIML scenarios.

[LG-18] Investigating Active Sampling for Hardness Classification with Vision-Based Tactile Sensors

链接: https://arxiv.org/abs/2505.13231
作者: Junyi Chen,Alap Kshirsagar,Frederik Heller,Mario Gómez Andreu,Boris Belousov,Tim Schneider,Lisa P. Y. Lin,Katja Doerschner,Knut Drewing,Jan Peters
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:One of the most important object properties that humans and robots perceive through touch is hardness. This paper investigates information-theoretic active sampling strategies for sample-efficient hardness classification with vision-based tactile sensors. We evaluate three probabilistic classifier models and two model-uncertainty-based sampling strategies on a robotic setup as well as on a previously published dataset of samples collected by human testers. Our findings indicate that the active sampling approaches, driven by uncertainty metrics, surpass a random sampling baseline in terms of accuracy and stability. Additionally, while in our human study, the participants achieve an average accuracy of 48.00%, our best approach achieves an average accuracy of 88.78% on the same set of objects, demonstrating the effectiveness of vision-based tactile sensors for object hardness classification.

[LG-19] Implicit bias produces neural scaling laws in learning curves from perceptrons to deep networks

链接: https://arxiv.org/abs/2505.13230
作者: Francesco D’Amico,Dario Bocchi,Matteo Negri
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scaling laws in deep learning - empirical power-law relationships linking model performance to resource growth - have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training or on the optimal training time given the model size. In this work, we uncover a richer picture by analyzing the entire training dynamics through the lens of spectral complexity norms. We identify two novel dynamical scaling laws that govern how performance evolves during training. These laws together recover the well-known test error scaling at convergence, offering a mechanistic explanation of generalization emergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a solvable model: a single-layer perceptron trained with binary cross-entropy. In this setting, we show that the growth of spectral complexity driven by the implicit bias mirrors the generalization behavior observed at fixed norm, allowing us to connect the performance dynamics to classical learning rules in the perceptron.

[LG-20] Inferring stochastic dynamics with growth from cross-sectional data

链接: https://arxiv.org/abs/2505.13197
作者: Stephen Zhang,Suryanarayana Maddu,Xiaoje Qiu,Victor Chardès
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Time-resolved single-cell omics data offers high-throughput, genome-wide measurements of cellular states, which are instrumental to reverse-engineer the processes underpinning cell fate. Such technologies are inherently destructive, allowing only cross-sectional measurements of the underlying stochastic dynamical system. Furthermore, cells may divide or die in addition to changing their molecular state. Collectively these present a major challenge to inferring realistic biophysical models. We present a novel approach, \emphunbalanced probability flow inference, that addresses this challenge for biological processes modelled as stochastic dynamics with growth. By leveraging a Lagrangian formulation of the Fokker-Planck equation, our method accurately disentangles drift from intrinsic noise and growth. We showcase the applicability of our approach through evaluation on a range of simulated and real single-cell RNA-seq datasets. Comparing to several existing methods, we find our method achieves higher accuracy while enjoying a simple two-step training scheme.

[LG-21] Interpretable Robotic Friction Learning via Symbolic Regression

链接: https://arxiv.org/abs/2505.13186
作者: Philipp Scholl,Alexander Dietrich,Sebastian Wolf,Jinoh Lee,Alin-Albu Schäffer,Gitta Kutyniok,Maged Iskandar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately modeling the friction torque in robotic joints has long been challenging due to the request for a robust mathematical description. Traditional model-based approaches are often labor-intensive, requiring extensive experiments and expert knowledge, and they are difficult to adapt to new scenarios and dependencies. On the other hand, data-driven methods based on neural networks are easier to implement but often lack robustness, interpretability, and trustworthiness–key considerations for robotic hardware and safety-critical applications such as human-robot interaction. To address the limitations of both approaches, we propose the use of symbolic regression (SR) to estimate the friction torque. SR generates interpretable symbolic formulas similar to those produced by model-based methods while being flexible to accommodate various dynamic effects and dependencies. In this work, we apply SR algorithms to approximate the friction torque using collected data from a KUKA LWR-IV+ robot. Our results show that SR not only yields formulas with comparable complexity to model-based approaches but also achieves higher accuracy. Moreover, SR-derived formulas can be seamlessly extended to include load dependencies and other dynamic factors.

[LG-22] RIFLES: Resource-effIcient Federated LEarning via Scheduling

链接: https://arxiv.org/abs/2505.13169
作者: Sara Alosaime(University of Warwick),Arshad Jhumka(University of Leeds)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a privacy-preserving machine learning technique that allows decentralized collaborative model training across a set of distributed clients, by avoiding raw data exchange. A fundamental component of FL is the selection of a subset of clients in each round for model training by a central server. Current selection strategies are myopic in nature in that they are based on past or current interactions, often leading to inefficiency issues such as straggling clients. In this paper, we address this serious shortcoming by proposing the RIFLES approach that builds a novel availability forecasting layer to support the client selection process. We make the following contributions: (i) we formalise the sequential selection problem and reduce it to a scheduling problem and show that the problem is NP-complete, (ii) leveraging heartbeat messages from clients, RIFLES build an availability prediction layer to support (long term) selection decisions, (iii) we propose a novel adaptive selection strategy to support efficient learning and resource usage. To circumvent the inherent exponential complexity, we present RIFLES, a heuristic that leverages clients’ historical availability data by using a CNN-LSTM time series forecasting model, allowing the server to predict the optimal participation times of clients, thereby enabling informed selection decisions. By comparing against other FL techniques, we show that RIFLES provide significant improvement by between 10%-50% on a variety of metrics such as accuracy and test loss. To the best of our knowledge, it is the first work to investigate FL as a scheduling problem.

[LG-23] Zero-Shot Adaptation of Behavioral Foundation Models to Unseen Dynamics

链接: https://arxiv.org/abs/2505.13150
作者: Maksim Bobrin,Ilya Zisman,Alexander Nikulin,Vladislav Kurenkov,Dmitry Dylov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Behavioral Foundation Models (BFMs) proved successful in producing policies for arbitrary tasks in a zero-shot manner, requiring no test-time training or task-specific fine-tuning. Among the most promising BFMs are the ones that estimate the successor measure learned in an unsupervised way from task-agnostic offline data. However, these methods fail to react to changes in the dynamics, making them inefficient under partial observability or when the transition function changes. This hinders the applicability of BFMs in a real-world setting, e.g., in robotics, where the dynamics can unexpectedly change at test time. In this work, we demonstrate that Forward-Backward (FB) representation, one of the methods from the BFM family, cannot distinguish between distinct dynamics, leading to an interference among the latent directions, which parametrize different policies. To address this, we propose a FB model with a transformer-based belief estimator, which greatly facilitates zero-shot adaptation. We also show that partitioning the policy encoding space into dynamics-specific clusters, aligned with the context-embedding directions, yields additional gain in performance. These traits allow our method to respond to the dynamics observed during training and to generalize to unseen ones. Empirically, in the changing dynamics setting, our approach achieves up to a 2x higher zero-shot returns compared to the baselines for both discrete and continuous tasks.

[LG-24] Parallel Layer Normalization for Universal Approximation

链接: https://arxiv.org/abs/2505.13142
作者: Yunhao Ni,Yuhe Liu,Wenxin Sun,Yitong Tang,Yuxin Guo,Peilin Feng,Wenjun Wu,Lei Huang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages

点击查看摘要

Abstract:Universal approximation theorem (UAT) is a fundamental theory for deep neural networks (DNNs), demonstrating their powerful representation capacity to represent and approximate any function. The analyses and proofs of UAT are based on traditional network with only linear and nonlinear activation functions, but omitting normalization layers, which are commonly employed to enhance the training of modern networks. This paper conducts research on UAT of DNNs with normalization layers for the first time. We theoretically prove that an infinitely wide network – composed solely of parallel layer normalization (PLN) and linear layers – has universal approximation capacity. Additionally, we investigate the minimum number of neurons required to approximate L -Lipchitz continuous functions, with a single hidden-layer network. We compare the approximation capacity of PLN with traditional activation functions in theory. Different from the traditional activation functions, we identify that PLN can act as both activation function and normalization in deep neural networks at the same time. We also find that PLN can improve the performance when replacing LN in transformer architectures, which reveals the potential of PLN used in neural architectures.

[LG-25] Neurosymbolic Diffusion Models

链接: https://arxiv.org/abs/2505.13138
作者: Emile van Krieken,Pasquale Minervini,Edoardo Ponti,Antonio Vergari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurosymbolic (NeSy) predictors combine neural perception with symbolic reasoning to solve tasks like visual reasoning. However, standard NeSy predictors assume conditional independence between the symbols they extract, thus limiting their ability to model interactions and uncertainty - often leading to overconfident predictions and poor out-of-distribution generalisation. To overcome the limitations of the independence assumption, we introduce neurosymbolic diffusion models (NeSyDMs), a new class of NeSy predictors that use discrete diffusion to model dependencies between symbols. Our approach reuses the independence assumption from NeSy predictors at each step of the diffusion process, enabling scalable learning while capturing symbol dependencies and uncertainty quantification. Across both synthetic and real-world benchmarks - including high-dimensional visual path planning and rule-based autonomous driving - NeSyDMs achieve state-of-the-art accuracy among NeSy predictors and demonstrate strong calibration.

[LG-26] Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation

链接: https://arxiv.org/abs/2505.13111
作者: Sungmin Cha,Kyunghyun Cho
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented–enabling smaller student models to emulate the performance of much larger teachers–the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage–a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a large-scale language modeling setup using the SmolLM2 family of models. Empirical results reveal the same precision-recall dynamics observed in simulation, where precision corresponds to sample quality and recall to distributional coverage. This precision-recall trade-off proves especially beneficial in scenarios where sample quality outweighs diversity, such as instruction tuning or downstream generation. Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.

[LG-27] me series saliency maps: explaining models across multiple domains

链接: https://arxiv.org/abs/2505.13100
作者: Christodoulos Kechris,Jonathan Dan,David Atienza
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional saliency map methods, popularized in computer vision, highlight individual points (pixels) of the input that contribute the most to the model’s output. However, in time-series they offer limited insights as semantically meaningful features are often found in other domains. We introduce Cross-domain Integrated Gradients, a generalization of Integrated Gradients. Our method enables feature attributions on any domain that can be formulated as an invertible, differentiable transformation of the time domain. Crucially, our derivation extends the original Integrated Gradients into the complex domain, enabling frequency-based attributions. We provide the necessary theoretical guarantees, namely, path independence and completeness. Our approach reveals interpretable, problem-specific attributions that time-domain methods cannot capture, on three real-world tasks: wearable sensor heart rate extraction, electroencephalography-based seizure detection, and zero-shot time-series forecasting. We release an open-source Tensorflow/PyTorch library to enable plug-and-play cross-domain explainability for time-series models. These results demonstrate the ability of cross-domain integrated gradients to provide semantically meaningful insights in time-series models that are impossible with traditional time-domain saliency.

[LG-28] reatment Effect Estimation for Optimal Decision-Making

链接: https://arxiv.org/abs/2505.13092
作者: Dennis Frauen,Valentyn Melnychuk,Jonas Schweisthal,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Decision-making across various fields, such as medicine, heavily relies on conditional average treatment effects (CATEs). Practitioners commonly make decisions by checking whether the estimated CATE is positive, even though the decision-making performance of modern CATE estimators is poorly understood from a theoretical perspective. In this paper, we study optimal decision-making based on two-stage CATE estimators (e.g., DR-learner), which are considered state-of-the-art and widely used in practice. We prove that, while such estimators may be optimal for estimating CATE, they can be suboptimal when used for decision-making. Intuitively, this occurs because such estimators prioritize CATE accuracy in regions far away from the decision boundary, which is ultimately irrelevant to decision-making. As a remedy, we propose a novel two-stage learning objective that retargets the CATE to balance CATE estimation error and decision performance. We then propose a neural method that optimizes an adaptively-smoothed approximation of our learning objective. Finally, we confirm the effectiveness of our method both empirically and theoretically. In sum, our work is the first to show how two-stage CATE estimators can be adapted for optimal decision-making.

[LG-29] Orthogonal Survival Learners for Estimating Heterogeneous Treatment Effects from Time-to-Event Data

链接: https://arxiv.org/abs/2505.13072
作者: Dennis Frauen,Maresa Schröder,Konstantin Hess,Stefan Feuerriegel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Estimating heterogeneous treatment effects (HTEs) is crucial for personalized decision-making. However, this task is challenging in survival analysis, which includes time-to-event data with censored outcomes (e.g., due to study dropout). In this paper, we propose a toolbox of novel orthogonal survival learners to estimate HTEs from time-to-event data under censoring. Our learners have three main advantages: (i) we show that learners from our toolbox are guaranteed to be orthogonal and thus come with favorable theoretical properties; (ii) our toolbox allows for incorporating a custom weighting function, which can lead to robustness against different types of low overlap, and (iii) our learners are model-agnostic (i.e., they can be combined with arbitrary machine learning models). We instantiate the learners from our toolbox using several weighting functions and, as a result, propose various neural orthogonal survival learners. Some of these coincide with existing survival learners (including survival versions of the DR- and R-learner), while others are novel and further robust w.r.t. low overlap regimes specific to the survival setting (i.e., survival overlap and censoring overlap). We then empirically verify the effectiveness of our learners for HTE estimation in different low-overlap regimes through numerical experiments. In sum, we provide practitioners with a large toolbox of learners that can be used for randomized and observational studies with censored time-to-event data.

[LG-30] OmniFC: Rethinking Federated Clustering via Lossless and Secure Distance Reconstruction

链接: https://arxiv.org/abs/2505.13071
作者: Jie Yan,Xin Liu,Zhong-Yuan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated clustering (FC) aims to discover global cluster structures across decentralized clients without sharing raw data, making privacy preservation a fundamental requirement. There are two critical challenges: (1) privacy leakage during collaboration, and (2) robustness degradation due to aggregation of proxy information from non-independent and identically distributed (Non-IID) local data, leading to inaccurate or inconsistent global clustering. Existing solutions typically rely on model-specific local proxies, which are sensitive to data heterogeneity and inherit inductive biases from their centralized counterparts, thus limiting robustness and generality. We propose Omni Federated Clustering (OmniFC), a unified and model-agnostic framework. Leveraging Lagrange coded computing, our method enables clients to share only encoded data, allowing exact reconstruction of the global distance matrix–a fundamental representation of sample relationships–without leaking private information, even under client collusion. This construction is naturally resilient to Non-IID data distributions. This approach decouples FC from model-specific proxies, providing a unified extension mechanism applicable to diverse centralized clustering methods. Theoretical analysis confirms both reconstruction fidelity and privacy guarantees, while comprehensive experiments demonstrate OmniFC’s superior robustness, effectiveness, and generality across various benchmarks compared to state-of-the-art methods. Code will be released.

[LG-31] Automatic mixed precision for optimizing gained time with constrained loss mean-squared-error based on model partition to sequential sub-graphs

链接: https://arxiv.org/abs/2505.13060
作者: Shmulik Markovich-Golan,Daniel Ohayon,Itay Niv,Yair Hanani
类目: Machine Learning (cs.LG)
*备注: Preprint, under review

点击查看摘要

Abstract:Quantization is essential for Neural Network (NN) compression, reducing model size and computational demands by using lower bit-width data types, though aggressive reduction often hampers accuracy. Mixed Precision (MP) mitigates this tradeoff by varying the numerical precision across network layers. This study focuses on automatically selecting an optimal MP configuration within Post-Training Quantization (PTQ) for inference. The first key contribution is a novel sensitivity metric derived from a first-order Taylor series expansion of the loss function as a function of quantization errors in weights and activations. This metric, based on the Mean Square Error (MSE) of the loss, is efficiently calculated per layer using high-precision forward and backward passes over a small calibration dataset. The metric is additive across layers, with low calibration memory overhead as weight optimization is unnecessary. The second contribution is an accurate hardware-aware method for predicting MP time gain by modeling it as additive for sequential sub-graphs. An algorithm partitions the model graph into sequential subgraphs, measuring time gain for each configuration using a few samples. After calibrating per-layer sensitivity and time gain, an Integer Programming (IP) problem is formulated to maximize time gain while keeping loss MSE below a set threshold. Memory gain and theoretical time gain based on Multiply and Accumulate (MAC) operations are also considered. Rigorous experiments on the Intel Gaudi 2 accelerator validate the approach on several Large Language Models (LLMs).

[LG-32] A Path to Universal Neural Cellular Automata GECCO’25

链接: https://arxiv.org/abs/2505.13058
作者: Gabriel Béna,Maxence Faldor,Dan F. M. Goodman,Antoine Cully
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注: Published in Genetic and Evolutionary Computation Conference (GECCO '25 Companion), July 14–18, 2025, Malaga, Spain. 8 Pages + References

点击查看摘要

Abstract:Cellular automata have long been celebrated for their ability to generate complex behaviors from simple, local rules, with well-known discrete models like Conway’s Game of Life proven capable of universal computation. Recent advancements have extended cellular automata into continuous domains, raising the question of whether these systems retain the capacity for universal computation. In parallel, neural cellular automata have emerged as a powerful paradigm where rules are learned via gradient descent rather than manually designed. This work explores the potential of neural cellular automata to develop a continuous Universal Cellular Automaton through training by gradient descent. We introduce a cellular automaton model, objective functions and training strategies to guide neural cellular automata toward universal computation in a continuous setting. Our experiments demonstrate the successful training of fundamental computational primitives - such as matrix multiplication and transposition - culminating in the emulation of a neural network solving the MNIST digit classification task directly within the cellular automata state. These results represent a foundational step toward realizing analog general-purpose computers, with implications for understanding universal computation in continuous dynamics and advancing the automated discovery of complex cellular automata behaviors via machine learning.

[LG-33] PPTNet: A Hybrid Periodic Pattern-Transformer Architecture for Traffic Flow Prediction and Congestion Identification

链接: https://arxiv.org/abs/2505.13047
作者: Hongrui Kou,Jingkai Li,Ziyu Wang,Zhouhang Lv,Yuxin Zhang,Cheng Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of traffic flow parameters and real time identification of congestion states are essential for the efficient operation of intelligent transportation systems. This paper proposes a Periodic Pattern Transformer Network (PPTNet) for traffic flow prediction, integrating periodic pattern extraction with the Transformer architecture, coupled with a fuzzy inference method for real-time congestion identification. Firstly, a high-precision traffic flow dataset (Traffic Flow Dataset for China’s Congested Highways and Expressways, TF4CHE) suitable for congested highway scenarios in China is constructed based on drone aerial imagery data. Subsequently, the proposed PPTNet employs Fast Fourier Transform to capture multi-scale periodic patterns and utilizes two-dimensional Inception convolutions to efficiently extract intra and inter periodic features. A Transformer decoder dynamically models temporal dependencies, enabling accurate predictions of traffic density and speed. Finally, congestion probabilities are calculated in real-time using the predicted outcomes via a Mamdani fuzzy inference-based congestion identification module. Experimental results demonstrate that the proposed PPTNet significantly outperforms mainstream traffic prediction methods in prediction accuracy, and the congestion identification module effectively identifies real-time road congestion states, verifying the superiority and practicality of the proposed method in real-world traffic scenarios. Project page: this https URL.

[LG-34] Unpacking Positional Encoding in Transformers: A Spectral Analysis of Content-Position Coupling

链接: https://arxiv.org/abs/2505.13027
作者: Zihan Gu,Han Zhang,Ruoyu Chen,Yue Hu,Hua Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Positional encoding (PE) is essential for enabling Transformers to model sequential structure. However, the mechanisms by which different PE schemes couple token content and positional information-and how these mechanisms influence model dynamics-remain theoretically underexplored. In this work, we present a unified framework that analyzes PE through the spectral properties of Toeplitz and related matrices derived from attention logits. We show that multiplicative content-position coupling-exemplified by Rotary Positional Encoding (RoPE) via a Hadamard product with a Toeplitz matrix-induces spectral contraction, which theoretically improves optimization stability and efficiency. Guided by this theory, we construct synthetic tasks that contrast content-position dependent and content-position independent settings, and evaluate a range of PE methods. Our experiments reveal strong alignment with theory: RoPE consistently outperforms other methods on position-sensitive tasks and induces “single-head deposit” patterns in early layers, indicating localized positional processing. Further analyses show that modifying the method and timing of PE coupling, such as MLA in Deepseek-V3, can effectively mitigate this concentration. These results establish explicit content-relative mixing with relative-position Toeplitz signals as a key principle for effective PE design and provide new insight into how positional structure is integrated in Transformer architectures.

[LG-35] Generative Modeling of Random Fields from Limited Data via Constrained Latent Flow Matching

链接: https://arxiv.org/abs/2505.13007
作者: James E. Warner,Tristan A. Shah,Patrick E. Leser,Geoffrey F. Bomarito,Joshua D. Pribe,Michael C. Stanley
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 10 pages plus references and appendices, 17 figures

点击查看摘要

Abstract:Deep generative models are promising tools for science and engineering, but their reliance on abundant, high-quality data limits applicability. We present a novel framework for generative modeling of random fields (probability distributions over continuous functions) that incorporates domain knowledge to supplement limited, sparse, and indirect data. The foundation of the approach is latent flow matching, where generative modeling occurs on compressed function representations in the latent space of a pre-trained variational autoencoder (VAE). Innovations include the adoption of a function decoder within the VAE and integration of physical/statistical constraints into the VAE training process. In this way, a latent function representation is learned that yields continuous random field samples satisfying domain-specific constraints when decoded, even in data-limited regimes. Efficacy is demonstrated on two challenging applications: wind velocity field reconstruction from sparse sensors and material property inference from a limited number of indirect measurements. Results show that the proposed framework achieves significant improvements in reconstruction accuracy compared to unconstrained methods and enables effective inference with relatively small training datasets that is intractable without constraints.

[LG-36] Optimal Formats for Weight Quantisation

链接: https://arxiv.org/abs/2505.12988
作者: Douglas Orr,Luka Ribar,Carlo Luschi
类目: Machine Learning (cs.LG)
*备注: 35 pages, 32 figures

点击查看摘要

Abstract:Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and the formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. Framing the optimisation problem as minimising the KL divergence between the original and quantised model outputs, the objective is aligned with minimising the squared quantisation error of the model parameters. We therefore develop and evaluate squared-error-optimal formats for known distributions, observing significant improvement of variable-length codes over fixed-length codes. Uniform quantisation followed by lossless compression with a variable-length code is shown to be optimal. However, we find that commonly used block formats and sparse outlier formats also outperform fixed-length codes, implying they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model’s layers, saving up to 0.25 bits per parameter when tested with direct-cast quantisation of language models.

[LG-37] Multi-parameter Control for the (1(λλ))-GA on OneMax via Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.12982
作者: Tai Nguyen,Phong Le,Carola Doerr,Nguyen Dang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:It is well known that evolutionary algorithms can benefit from dynamic choices of the key parameters that control their behavior, to adjust their search strategy to the different stages of the optimization process. A prominent example where dynamic parameter choices have shown a provable super-constant speed-up is the (1+(\lambda,\lambda)) Genetic Algorithm optimizing the OneMax function. While optimal parameter control policies result in linear expected running times, this is not possible with static parameter choices. This result has spurred a lot of interest in parameter control policies. However, many works, in particular theoretical running time analyses, focus on controlling one single parameter. Deriving policies for controlling multiple parameters remains very challenging. In this work we reconsider the problem of the (1+(\lambda,\lambda)) Genetic Algorithm optimizing OneMax. We decouple its four main parameters and investigate how well state-of-the-art deep reinforcement learning techniques can approximate good control policies. We show that although making deep reinforcement learning learn effectively is a challenging task, once it works, it is very powerful and is able to find policies that outperform all previously known control policies on the same benchmark. Based on the results found through reinforcement learning, we derive a simple control policy that consistently outperforms the default theory-recommended setting by 27% and the irace-tuned policy, the strongest existing control policy on this benchmark, by 13% , for all tested problem sizes up to 40,000 .

[LG-38] Augmented Regression Models using Neurochaos Learning

链接: https://arxiv.org/abs/2505.12967
作者: Akhila Henry,Nithin Nagaraj
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:This study presents novel Augmented Regression Models using Neurochaos Learning (NL), where Tracemean features derived from the Neurochaos Learning framework are integrated with traditional regression algorithms : Linear Regression, Ridge Regression, Lasso Regression, and Support Vector Regression (SVR). Our approach was evaluated using ten diverse real-life datasets and a synthetically generated dataset of the form y = mx + c + \epsilon . Results show that incorporating the Tracemean feature (mean of the chaotic neural traces of the neurons in the NL architecture) significantly enhances regression performance, particularly in Augmented Lasso Regression and Augmented SVR, where six out of ten real-life datasets exhibited improved predictive accuracy. Among the models, Augmented Chaotic Ridge Regression achieved the highest average performance boost (11.35 %). Additionally, experiments on the simulated dataset demonstrated that the Mean Squared Error (MSE) of the augmented models consistently decreased and converged towards the Minimum Mean Squared Error (MMSE) as the sample size increased. This work demonstrates the potential of chaos-inspired features in regression tasks, offering a pathway to more accurate and computationally efficient prediction models.

[LG-39] LoD: Loss-difference OOD Detection by Intentionally Label-Noisifying Unlabeled Wild Data IJCAI2025

链接: https://arxiv.org/abs/2505.12952
作者: Chuanxing Geng,Qifei Li,Xinrui Wang,Dong Liang,Songcan Chen,Pong C. Yuen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by IJCAI2025

点击查看摘要

Abstract:Using unlabeled wild data containing both in-distribution (ID) and out-of-distribution (OOD) data to improve the safety and reliability of models has recently received increasing attention. Existing methods either design customized losses for labeled ID and unlabeled wild data then perform joint optimization, or first filter out OOD data from the latter then learn an OOD detector. While achieving varying degrees of success, two potential issues remain: (i) Labeled ID data typically dominates the learning of models, inevitably making models tend to fit OOD data as IDs; (ii) The selection of thresholds for identifying OOD data in unlabeled wild data usually faces dilemma due to the unavailability of pure OOD samples. To address these issues, we propose a novel loss-difference OOD detection framework (LoD) by \textitintentionally label-noisifying unlabeled wild data. Such operations not only enable labeled ID data and OOD data in unlabeled wild data to jointly dominate the models’ learning but also ensure the distinguishability of the losses between ID and OOD samples in unlabeled wild data, allowing the classic clustering technique (e.g., K-means) to filter these OOD samples without requiring thresholds any longer. We also provide theoretical foundation for LoD’s viability, and extensive experiments verify its superiority.

[LG-40] Multi-Level Monte Carlo Training of Neural Operators

链接: https://arxiv.org/abs/2505.12940
作者: James Rowbottom,Stefania Fresca,Pietro Lio,Carola-Bibiane Schönlieb,Nicolas Boullé
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Operator learning is a rapidly growing field that aims to approximate nonlinear operators related to partial differential equations (PDEs) using neural operators. These rely on discretization of input and output functions and are, usually, expensive to train for large-scale problems at high-resolution. Motivated by this, we present a Multi-Level Monte Carlo (MLMC) approach to train neural operators by leveraging a hierarchy of resolutions of function dicretization. Our framework relies on using gradient corrections from fewer samples of fine-resolution data to decrease the computational cost of training while maintaining a high level accuracy. The proposed MLMC training procedure can be applied to any architecture accepting multi-resolution data. Our numerical experiments on a range of state-of-the-art models and test-cases demonstrate improved computational efficiency compared to traditional single-resolution training approaches, and highlight the existence of a Pareto curve between accuracy and computational time, related to the number of samples per resolution.

[LG-41] RGNMR: A Gauss-Newton method for robust matrix completion with theoretical guarantees

链接: https://arxiv.org/abs/2505.12919
作者: Eilon Vaknin Laufer,Boaz Nadler
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recovering a low rank matrix from a subset of its entries, some of which may be corrupted, is known as the robust matrix completion (RMC) problem. Existing RMC methods have several limitations: they require a relatively large number of observed entries; they may fail under overparametrization, when their assumed rank is higher than the correct one; and many of them fail to recover even mildly ill-conditioned matrices. In this paper we propose a novel RMC method, denoted \textttRGNMR , which overcomes these limitations. \textttRGNMR is a simple factorization-based iterative algorithm, which combines a Gauss-Newton linearization with removal of entries suspected to be outliers. On the theoretical front, we prove that under suitable assumptions, \textttRGNMR is guaranteed exact recovery of the underlying low rank matrix. Our theoretical results improve upon the best currently known for factorization-based methods. On the empirical front, we show via several simulations the advantages of \textttRGNMR over existing RMC methods, and in particular its ability to handle a small number of observed entries, overparameterization of the rank and ill-conditioned matrices.

[LG-42] mporal Query Network for Efficient Multivariate Time Series Forecasting ICML2025

链接: https://arxiv.org/abs/2505.12917
作者: Shengsheng Lin,Haojun Chen,Haijie Wu,Chunyun Qiu,Weiwei Lin
类目: Machine Learning (cs.LG)
*备注: ICML 2025

点击查看摘要

Abstract:Sufficiently modeling the correlations among variables (aka channels) is crucial for achieving accurate multivariate time series forecasting (MTSF). In this paper, we propose a novel technique called Temporal Query (TQ) to more effectively capture multivariate correlations, thereby improving model performance in MTSF tasks. Technically, the TQ technique employs periodically shifted learnable vectors as queries in the attention mechanism to capture global inter-variable patterns, while the keys and values are derived from the raw input data to encode local, sample-level correlations. Building upon the TQ technique, we develop a simple yet efficient model named Temporal Query Network (TQNet), which employs only a single-layer attention mechanism and a lightweight multi-layer perceptron (MLP). Extensive experiments demonstrate that TQNet learns more robust multivariate correlations, achieving state-of-the-art forecasting accuracy across 12 challenging real-world datasets. Furthermore, TQNet achieves high efficiency comparable to linear-based methods even on high-dimensional datasets, balancing performance and computational cost. The code is available at: this https URL.

[LG-43] Active Learning on Synthons for Molecular Design ICLR2025

链接: https://arxiv.org/abs/2505.12913
作者: Tom George Grigg,Mason Burlage,Oliver Brook Scott,Adam Taouil,Dominique Sydow,Liam Wilbraham
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 14 pages, 10 figures. Presented at ICLR 2025 GEM Workshop

点击查看摘要

Abstract:Exhaustive virtual screening is highly informative but often intractable against the expensive objective functions involved in modern drug discovery. This problem is exacerbated in combinatorial contexts such as multi-vector expansion, where molecular spaces can quickly become ultra-large. Here, we introduce Scalable Active Learning via Synthon Acquisition (SALSA): a simple algorithm applicable to multi-vector expansion which extends pool-based active learning to non-enumerable spaces by factoring modeling and acquisition over synthon or fragment choices. Through experiments on ligand- and structure-based objectives, we highlight SALSA’s sample efficiency, and its ability to scale to spaces of trillions of compounds. Further, we demonstrate application toward multi-parameter objective design tasks on three protein targets - finding SALSA-generated molecules have comparable chemical property profiles to known bioactives, and exhibit greater diversity and higher scores over an industry-leading generative approach.

[LG-44] Efficient training for large-scale optical neural network using an evolutionary strategy and attention pruning

链接: https://arxiv.org/abs/2505.12906
作者: Zhiwei Yang,Zeyang Fan,Yihang Lai,Qi Chen,Tian Zhang,Jian Dai,Kun Xu
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:MZI-based block optical neural networks (BONNs), which can achieve large-scale network models, have increasingly drawn attentions. However, the robustness of the current training algorithm is not high enough. Moreover, large-scale BONNs usually contain numerous trainable parameters, resulting in expensive computation and power consumption. In this article, by pruning matrix blocks and directly optimizing the individuals in population, we propose an on-chip covariance matrix adaptation evolution strategy and attention-based pruning (CAP) algorithm for large-scale BONNs. The calculated results demonstrate that the CAP algorithm can prune 60% and 80% of the parameters for MNIST and Fashion-MNIST datasets, respectively, while only degrades the performance by 3.289% and 4.693%. Considering the influence of dynamic noise in phase shifters, our proposed CAP algorithm (performance degradation of 22.327% for MNIST dataset and 24.019% for Fashion-MNIST dataset utilizing a poor fabricated chip and electrical control with a standard deviation of 0.5) exhibits strongest robustness compared with both our previously reported block adjoint training algorithm (43.963% and 41.074%) and the covariance matrix adaptation evolution strategy (25.757% and 32.871%), respectively. Moreover, when 60% of the parameters are pruned, the CAP algorithm realizes 88.5% accuracy in experiment for the simplified MNIST dataset, which is similar to the simulation result without noise (92.1%). Additionally, we simulationally and experimentally demonstrate that using MZIs with only internal phase shifters to construct BONNs is an efficient way to reduce both the system area and the required trainable parameters. Notably, our proposed CAP algorithm show excellent potential for larger-scale network models and more complex tasks.

[LG-45] Power Allocation for Delay Optimization in Device-to-Device Networks: A Graph Reinforcement Learning Approach

链接: https://arxiv.org/abs/2505.12902
作者: Hao Fang,Kai Huang,Hao Ye,Chongtao Guo,Le Liang,Xiao Li,Shi Jin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The pursuit of rate maximization in wireless communication frequently encounters substantial challenges associated with user fairness. This paper addresses these challenges by exploring a novel power allocation approach for delay optimization, utilizing graph neural networks (GNNs)-based reinforcement learning (RL) in device-to-device (D2D) communication. The proposed approach incorporates not only channel state information but also factors such as packet delay, the number of backlogged packets, and the number of transmitted packets into the components of the state information. We adopt a centralized RL method, where a central controller collects and processes the state information. The central controller functions as an agent trained using the proximal policy optimization (PPO) algorithm. To better utilize topology information in the communication network and enhance the generalization of the proposed method, we embed GNN layers into both the actor and critic networks of the PPO algorithm. This integration allows for efficient parameter updates of GNNs and enables the state information to be parameterized as a low-dimensional embedding, which is leveraged by the agent to optimize power allocation strategies. Simulation results demonstrate that the proposed method effectively reduces average delay while ensuring user fairness, outperforms baseline methods, and exhibits scalability and generalization capability.

[LG-46] heoretical Investigation on Inductive Bias of Isolation Forest

链接: https://arxiv.org/abs/2505.12825
作者: Qin-Cheng Zheng,Shao-Qun Zhang,Shen-Huan Lyu,Yuan Jiang,Zhi-Hua Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector valued for its exceptional runtime efficiency and performance on large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest’s success remains unclear. This paper theoretically investigates the conditions and extent of iForest’s effectiveness by analyzing its inductive bias through the formulation of depth functions and growth processes. Since directly analyzing the depth function proves intractable due to iForest’s random splitting mechanism, we model the growth process of iForest as a random walk, enabling us to derive the expected depth function using transition probabilities. Our case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to k -Nearest Neighbor anomaly detectors. Our study provides theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.

[LG-47] Koopman Autoencoders Learn Neural Representation Dynamics

链接: https://arxiv.org/abs/2505.12809
作者: Nishant Suresh Aswani,Saif Eddin Jabari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores a simple question: can we model the internal transformations of a neural network using dynamical systems theory? We introduce Koopman autoencoders to capture how neural representations evolve through network layers, treating these representations as states in a dynamical system. Our approach learns a surrogate model that predicts how neural representations transform from input to output, with two key advantages. First, by way of lifting the original states via an autoencoder, it operates in a linear space, making editing the dynamics straightforward. Second, it preserves the topologies of the original representations by regularizing the autoencoding objective. We demonstrate that these surrogate models naturally replicate the progressive topological simplification observed in neural networks. As a practical application, we show how our approach enables targeted class unlearning in the Yin-Yang and MNIST classification tasks.

[LG-48] Unlearning for Federated Online Learning to Rank: A Reproducibility Study SIGIR2025

链接: https://arxiv.org/abs/2505.12791
作者: Yiling Tao,Shuyi Wang,Jiaxi Yang,Guido Zuccon
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at SIGIR2025

点击查看摘要

Abstract:This paper reports on findings from a comparative study on the effectiveness and efficiency of federated unlearning strategies within Federated Online Learning to Rank (FOLTR), with specific attention to systematically analysing the unlearning capabilities of methods in a verifiable manner. Federated approaches to ranking of search results have recently garnered attention to address users privacy concerns. In FOLTR, privacy is safeguarded by collaboratively training ranking models across decentralized data sources, preserving individual user data while optimizing search results based on implicit feedback, such as clicks. Recent legislation introduced across numerous countries is establishing the so called “the right to be forgotten”, according to which services based on machine learning models like those in FOLTR should provide capabilities that allow users to remove their own data from those used to train models. This has sparked the development of unlearning methods, along with evaluation practices to measure whether unlearning of a user data successfully occurred. Current evaluation practices are however often controversial, necessitating the use of multiple metrics for a more comprehensive assessment – but previous proposals of unlearning methods only used single evaluation metrics. This paper addresses this limitation: our study rigorously assesses the effectiveness of unlearning strategies in managing both under-unlearning and over-unlearning scenarios using adapted, and newly proposed evaluation metrics. Thanks to our detailed analysis, we uncover the strengths and limitations of five unlearning strategies, offering valuable insights into optimizing federated unlearning to balance data privacy and system performance within FOLTR. We publicly release our code and complete results at this https URL. Comments: Accepted at SIGIR2025 Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2505.12791 [cs.IR] (or arXiv:2505.12791v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.12791 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shuyi Wang [view email] [v1] Mon, 19 May 2025 07:23:46 UTC (4,607 KB)

[LG-49] Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization

链接: https://arxiv.org/abs/2505.12759
作者: Haochen Yuan,Minting Pan,Yunbo Wang,Siyu Gao,Philip S.Yu,Xiaokang Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has shown significant promise for sequential portfolio optimization tasks, such as stock trading, where the objective is to maximize cumulative returns while minimizing risks using historical data. However, traditional RL approaches often produce policies that merely memorize the optimal yet impractical buying and selling behaviors within the fixed dataset. These offline policies are less generalizable as they fail to account for the non-stationary nature of the market. Our approach, MetaTrader, frames portfolio optimization as a new type of partial-offline RL problem and makes two technical contributions. First, MetaTrader employs a bilevel learning framework that explicitly trains the RL agent to improve both in-domain profits on the original dataset and out-of-domain performance across diverse transformations of the raw financial data. Second, our approach incorporates a new temporal difference (TD) method that approximates worst-case TD estimates from a batch of transformed TD targets, addressing the value overestimation issue that is particularly challenging in scenarios with limited offline data. Our empirical results on two public stock datasets show that MetaTrader outperforms existing methods, including both RL-based approaches and traditional stock prediction models.

[LG-50] ProDS: Preference-oriented Data Selection for Instruction Tuning

链接: https://arxiv.org/abs/2505.12754
作者: Wenya Guo,Zhengkun Zhang,Xumeng Liu,Ying Zhang,Ziyu Lu,Haoze Zhu,Xubo Liu,Ruxue Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Instruction data selection aims to identify a high-quality subset from the training set that matches or exceeds the performance of the full dataset on target tasks. Existing methods focus on the instruction-to-response mapping, but neglect the human preference for diverse responses. In this paper, we propose Preference-oriented Data Selection method (ProDS) that scores training samples based on their alignment with preferences observed in the target set. Our key innovation lies in shifting the data selection criteria from merely estimating features for accurate response generation to explicitly aligning training samples with human preferences in target tasks. Specifically, direct preference optimization (DPO) is employed to estimate human preferences across diverse responses. Besides, a bidirectional preference synthesis strategy is designed to score training samples according to both positive preferences and negative preferences. Extensive experimental results demonstrate our superiority to existing task-agnostic and targeted methods.

[LG-51] Deep Unfolding with Kernel-based Quantization in MIMO Detection ICML

链接: https://arxiv.org/abs/2505.12736
作者: Zeyi Ren,Jingreng Lei,Yichen Jin,Ermo Hua,Qingfeng Lin,Chen Zhang,Bowen Zhou,Yik-Chung Wu
类目: Machine Learning (cs.LG)
*备注: submitted to ICML ML4Wireless workshop

点击查看摘要

Abstract:The development of edge computing places critical demands on energy-efficient model deployment for multiple-input multiple-output (MIMO) detection tasks. Deploying deep unfolding models such as PGD-Nets and ADMM-Nets into resource-constrained edge devices using quantization methods is challenging. Existing quantization methods based on quantization aware training (QAT) suffer from performance degradation due to their reliance on parametric distribution assumption of activations and static quantization step sizes. To address these challenges, this paper proposes a novel kernel-based adaptive quantization (KAQ) framework for deep unfolding networks. By utilizing a joint kernel density estimation (KDE) and maximum mean discrepancy (MMD) approach to align activation distributions between full-precision and quantized models, the need for prior distribution assumptions is eliminated. Additionally, a dynamic step size updating method is introduced to adjust the quantization step size based on the channel conditions of wireless networks. Extensive simulations demonstrate that the accuracy of proposed KAQ framework outperforms traditional methods and successfully reduces the model’s inference latency.

[LG-52] Identifiability of Nonnegative Tucker Decompositions – Part I: Theory

链接: https://arxiv.org/abs/2505.12713
作者: Subhayan Saha,Giovanni Barbarino,Nicolas Gillis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: 40 pages, 2 figures

点击查看摘要

Abstract:Tensor decompositions have become a central tool in data science, with applications in areas such as data analysis, signal processing, and machine learning. A key property of many tensor decompositions, such as the canonical polyadic decomposition, is identifiability: the factors are unique, up to trivial scaling and permutation ambiguities. This allows one to recover the groundtruth sources that generated the data. The Tucker decomposition (TD) is a central and widely used tensor decomposition model. However, it is in general not identifiable. In this paper, we study the identifiability of the nonnegative TD (nTD). By adapting and extending identifiability results of nonnegative matrix factorization (NMF), we provide uniqueness results for nTD. Our results require the nonnegative matrix factors to have some degree of sparsity (namely, satisfy the separability condition, or the sufficiently scattered condition), while the core tensor only needs to have some slices (or linear combinations of them) or unfoldings with full column rank (but does not need to be nonnegative). Under such conditions, we derive several procedures, using either unfoldings or slices of the input tensor, to obtain identifiable nTDs by minimizing the volume of unfoldings or slices of the core tensor.

[LG-53] Confidence-Regulated Generative Diffusion Models for Reliable AI Agent Migration in Vehicular Metaverses

链接: https://arxiv.org/abs/2505.12710
作者: Yingkai Kang,Jiawen Kang,Jinbo Wen,Tao Zhang,Zhaohui Yang,Dusit Niyato,Yan Zhang
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Vehicular metaverses are an emerging paradigm that merges intelligent transportation systems with virtual spaces, leveraging advanced digital twin and Artificial Intelligence (AI) technologies to seamlessly integrate vehicles, users, and digital environments. In this paradigm, vehicular AI agents are endowed with environment perception, decision-making, and action execution capabilities, enabling real-time processing and analysis of multi-modal data to provide users with customized interactive services. Since vehicular AI agents require substantial resources for real-time decision-making, given vehicle mobility and network dynamics conditions, the AI agents are deployed in RoadSide Units (RSUs) with sufficient resources and dynamically migrated among them. However, AI agent migration requires frequent data exchanges, which may expose vehicular metaverses to potential cyber attacks. To this end, we propose a reliable vehicular AI agent migration framework, achieving reliable dynamic migration and efficient resource scheduling through cooperation between vehicles and RSUs. Additionally, we design a trust evaluation model based on the theory of planned behavior to dynamically quantify the reputation of RSUs, thereby better accommodating the personalized trust preferences of users. We then model the vehicular AI agent migration process as a partially observable markov decision process and develop a Confidence-regulated Generative Diffusion Model (CGDM) to efficiently generate AI agent migration decisions. Numerical results demonstrate that the CGDM algorithm significantly outperforms baseline methods in reducing system latency and enhancing robustness against cyber attacks.

[LG-54] Pave Your Own Path: Graph Gradual Domain Adaptation on Fused Gromov-Wasserstein Geodesics

链接: https://arxiv.org/abs/2505.12709
作者: Zhichen Zeng,Ruizhong Qiu,Wenxuan Bao,Tianxin Wei,Xiao Lin,Yuchen Yan,Tarek F. Abdelzaher,Jiawei Han,Hanghang Tong
类目: Machine Learning (cs.LG)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:Graph neural networks, despite their impressive performance, are highly vulnerable to distribution shifts on graphs. Existing graph domain adaptation (graph DA) methods often implicitly assume a \textitmild shift between source and target graphs, limiting their applicability to real-world scenarios with \textitlarge shifts. Gradual domain adaptation (GDA) has emerged as a promising approach for addressing large shifts by gradually adapting the source model to the target domain via a path of unlabeled intermediate domains. Existing GDA methods exclusively focus on independent and identically distributed (IID) data with a predefined path, leaving their extension to \textitnon-IID graphs without a given path an open challenge. To bridge this gap, we present Gadget, the first GDA framework for non-IID graph data. First (\textittheoretical foundation), the Fused Gromov-Wasserstein (FGW) distance is adopted as the domain discrepancy for non-IID graphs, based on which, we derive an error bound revealing that the target domain error is proportional to the length of the path. Second (\textitoptimal path), guided by the error bound, we identify the FGW geodesic as the optimal path, which can be efficiently generated by our proposed algorithm. The generated path can be seamlessly integrated with existing graph DA methods to handle large shifts on graphs, improving state-of-the-art graph DA methods by up to 6.8% in node classification accuracy on real-world datasets.

[LG-55] RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations

链接: https://arxiv.org/abs/2505.12686
作者: Seungmin Kim,Sohee Park,Donghyun Kim,Jisu Lee,Daeseon Choi
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:With the advancement of AI-based speech synthesis technologies such as Deep Voice, there is an increasing risk of voice spoofing attacks, including voice phishing and fake news, through unauthorized use of others’ voices. Existing defenses that inject adversarial perturbations directly into audio signals have limited effectiveness, as these perturbations can easily be neutralized by speech enhancement methods. To overcome this limitation, we propose RoVo (Robust Voice), a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, reconstructing them into protected speech. This approach effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models, which represent a secondary attack threat. In extensive experiments, RoVo increased the Defense Success Rate (DSR) by over 70% compared to unprotected speech, across four state-of-the-art speech synthesis models. Specifically, RoVo achieved a DSR of 99.5% on a commercial speaker-verification API, effectively neutralizing speech synthesis attack. Moreover, RoVo’s perturbations remained robust even under strong speech enhancement conditions, outperforming traditional methods. A user study confirmed that RoVo preserves both naturalness and usability of protected speech, highlighting its effectiveness in complex and evolving threat scenarios. Subjects: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2505.12686 [cs.LG] (or arXiv:2505.12686v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.12686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] DimGrow: Memory-Efficient Field-level Embedding Dimension Search

链接: https://arxiv.org/abs/2505.12683
作者: Yihong Huang,Chen Chu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Key feature fields need bigger embedding dimensionality, others need smaller. This demands automated dimension allocation. Existing approaches, such as pruning or Neural Architecture Search (NAS), require training a memory-intensive SuperNet that enumerates all possible dimension combinations, which is infeasible for large feature spaces. We propose DimGrow, a lightweight approach that eliminates the SuperNet requirement. Starting training model from one dimension per feature field, DimGrow can progressively expand/shrink dimensions via importance scoring. Dimensions grow only when their importance consistently exceed a threshold, ensuring memory efficiency. Experiments on three recommendation datasets verify the effectiveness of DimGrow while it reduces training memory compared to SuperNet-based methods.

[LG-57] RoFL: Robust Fingerprinting of Language Models

链接: https://arxiv.org/abs/2505.12682
作者: Yun-Yun Tsai,Chuan Guo,Junfeng Yang,Laurens van der Maaten
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI developers are releasing large language models (LLMs) under a variety of different licenses. Many of these licenses restrict the ways in which the models or their outputs may be used. This raises the question how license violations may be recognized. In particular, how can we identify that an API or product uses (an adapted version of) a particular LLM? We present a new method that enable model developers to perform such identification via fingerprints: statistical patterns that are unique to the developer’s model and robust to common alterations of that model. Our method permits model identification in a black-box setting using a limited number of queries, enabling identification of models that can only be accessed via an API or product. The fingerprints are non-invasive: our method does not require any changes to the model during training, hence by design, it does not impact model quality. Empirically, we find our method provides a high degree of robustness to common changes in the model or inference settings. In our experiments, it substantially outperforms prior art, including invasive methods that explicitly train watermarks into the model.

[LG-58] ransferTraj: A Vehicle Trajectory Learning Model for Region and Task Transferability

链接: https://arxiv.org/abs/2505.12672
作者: Tonglong Wei,Yan Lin,Zeyu Zhou,Haomin Wen,Jilin Hu,Shengnan Guo,Youfang Lin,Gao Cong,Huaiyu Wan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vehicle GPS trajectories provide valuable movement information that supports various downstream tasks and applications. A desirable trajectory learning model should be able to transfer across regions and tasks without retraining, avoiding the need to maintain multiple specialized models and subpar performance with limited training data. However, each region has its unique spatial features and contexts, which are reflected in vehicle movement patterns and difficult to generalize. Additionally, transferring across different tasks faces technical challenges due to the varying input-output structures required for each task. Existing efforts towards transferability primarily involve learning embedding vectors for trajectories, which perform poorly in region transfer and require retraining of prediction modules for task transfer. To address these challenges, we propose TransferTraj, a vehicle GPS trajectory learning model that excels in both region and task transferability. For region transferability, we introduce RTTE as the main learnable module within TransferTraj. It integrates spatial, temporal, POI, and road network modalities of trajectories to effectively manage variations in spatial context distribution across regions. It also introduces a TRIE module for incorporating relative information of spatial features and a spatial context MoE module for handling movement patterns in diverse contexts. For task transferability, we propose a task-transferable input-output scheme that unifies the input-output structure of different tasks into the masking and recovery of modalities and trajectory points. This approach allows TransferTraj to be pre-trained once and transferred to different tasks without retraining. Extensive experiments on three real-world vehicle trajectory datasets under task transfer, zero-shot, and few-shot region transfer, validating TransferTraj’s effectiveness. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.12672 [cs.LG] (or arXiv:2505.12672v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.12672 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-59] Spiking Neural Network: a low power solution for physical layer authentication

链接: https://arxiv.org/abs/2505.12647
作者: Jung Hoon Lee,Sujith Vijayan
类目: Machine Learning (cs.LG)
*备注: 11 pages, 7 figures and 2 pages

点击查看摘要

Abstract:Deep learning (DL) is a powerful tool that can solve complex problems, and thus, it seems natural to assume that DL can be used to enhance the security of wireless communication. However, deploying DL models to edge devices in wireless networks is challenging, as they require significant amounts of computing and power resources. Notably, Spiking Neural Networks (SNNs) are known to be efficient in terms of power consumption, meaning they can be an alternative platform for DL models for edge devices. In this study, we ask if SNNs can be used in physical layer authentication. Our evaluation suggests that SNNs can learn unique physical properties (i.e., `fingerprints’) of RF transmitters and use them to identify individual devices. Furthermore, we find that SNNs are also vulnerable to adversarial attacks and that an autoencoder can be used clean out adversarial perturbations to harden SNNs against them.

[LG-60] Dual-Agent Reinforcement Learning for Automated Feature Generation

链接: https://arxiv.org/abs/2505.12628
作者: Wanfu Gao,Zengyao Man,Hanlin Pan,Kunpeng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature generation involves creating new features from raw data to capture complex relationships among the original features, improving model robustness and machine learning performance. Current methods using reinforcement learning for feature generation have made feature exploration more flexible and efficient. However, several challenges remain: first, during feature expansion, a large number of redundant features are generated. When removing them, current methods only retain the best features each round, neglecting those that perform poorly initially but could improve later. Second, the state representation used by current methods fails to fully capture complex feature relationships. Third, there are significant differences between discrete and continuous features in tabular data, requiring different operations for each type. To address these challenges, we propose a novel dual-agent reinforcement learning method for feature generation. Two agents are designed: the first generates new features, and the second determines whether they should be preserved. A self-attention mechanism enhances state representation, and diverse operations distinguish interactions between discrete and continuous features. The experimental results on multiple datasets demonstrate that the proposed method is effective. The code is available at this https URL.

[LG-61] Adaptive Graph Unlearning IJCAI2025

链接: https://arxiv.org/abs/2505.12614
作者: Pengfei Ding,Yan Wang,Guanfeng Liu,Jiajie Zhu
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by IJCAI 2025

点击查看摘要

Abstract:Graph unlearning, which deletes graph elements such as nodes and edges from trained graph neural networks (GNNs), is crucial for real-world applications where graph data may contain outdated, inaccurate, or privacy-sensitive information. However, existing methods often suffer from (1) incomplete or over unlearning due to neglecting the distinct objectives of different unlearning tasks, and (2) inaccurate identification of neighbors affected by deleted elements across various GNN architectures. To address these limitations, we propose AGU, a novel Adaptive Graph Unlearning framework that flexibly adapts to diverse unlearning tasks and GNN architectures. AGU ensures the complete forgetting of deleted elements while preserving the integrity of the remaining graph. It also accurately identifies affected neighbors for each GNN architecture and prioritizes important ones to enhance unlearning performance. Extensive experiments on seven real-world graphs demonstrate that AGU outperforms existing methods in terms of effectiveness, efficiency, and unlearning capability.

[LG-62] Action-Dependent Optimality-Preserving Reward Shaping ICML2025 AAMAS2025

链接: https://arxiv.org/abs/2505.12611
作者: Grant C. Forbes,Jianxun Wang,Leonardo Villalobos-Arias,Arnav Jhala,David L. Roberts
类目: Machine Learning (cs.LG)
*备注: Extended abstract at AAMAS 2025; full paper at ICML 2025

点击查看摘要

Abstract:Recent RL research has utilized reward shaping–particularly complex shaping rewards such as intrinsic motivation (IM)–to encourage agent exploration in sparse-reward environments. While often effective, ``reward hacking’’ can lead to the shaping reward being optimized at the expense of the extrinsic reward, resulting in a suboptimal policy. Potential-Based Reward Shaping (PBRS) techniques such as Generalized Reward Matching (GRM) and Policy-Invariant Explicit Shaping (PIES) have mitigated this. These methods allow for implementing IM without altering optimal policies. In this work we show that they are effectively unsuitable for complex, exploration-heavy environments with long-duration episodes. To remedy this, we introduce Action-Dependent Optimality Preserving Shaping (ADOPS), a method of converting intrinsic rewards to an optimality-preserving form that allows agents to utilize IM more effectively in the extremely sparse environment of Montezuma’s Revenge. We also prove ADOPS accommodates reward shaping functions that cannot be written in a potential-based form: while PBRS-based methods require the cumulative discounted intrinsic return be independent of actions, ADOPS allows for intrinsic cumulative returns to be dependent on agents’ actions while still preserving the optimal policy set. We show how action-dependence enables ADOPS’s to preserve optimality while learning in complex, sparse-reward environments where other methods struggle.

[LG-63] he Hamiltonian of Poly-matrix Zero-sum Games

链接: https://arxiv.org/abs/2505.12609
作者: Toshihiro Ota,Yuma Fujimoto
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Chaotic Dynamics (nlin.CD)
*备注: 26 pages, 4 figures

点击查看摘要

Abstract:Understanding a dynamical system fundamentally relies on establishing an appropriate Hamiltonian function and elucidating its symmetries. By formulating agents’ strategies and cumulative payoffs as canonically conjugate variables, we identify the Hamiltonian function that generates the dynamics of poly-matrix zero-sum games. We reveal the symmetries of our Hamiltonian and derive the associated conserved quantities, showing how the conservation of probability and the invariance of the Fenchel coupling are intrinsically encoded within the system. Furthermore, we propose the dissipation FTRL (DFTRL) dynamics by introducing a perturbation that dissipates the Fenchel coupling, proving convergence to the Nash equilibrium and linking DFTRL to last-iterate convergent algorithms. Our results highlight the potential of Hamiltonian dynamics in uncovering the structural properties of learning dynamics in games, and pave the way for broader applications of Hamiltonian dynamics in game theory and machine learning.

[LG-64] Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

链接: https://arxiv.org/abs/2505.12601
作者: Yang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) grow in scale and specialization, routing–selecting the best model for a given input–has become essential for efficient and effective deployment. While recent methods rely on complex learned routing strategies, their dependence on disparate training data and evaluation setups makes comparison and generalization difficult. In this work, we revisit LLM routing through the lens of simplicity. We show that a well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. To support systematic evaluation, we introduce a suite of standardized routing benchmarks spanning instruction-following, question-answering, and reasoning tasks, as well as the first multi-modal routing dataset involving visual inputs. Our findings reveal that the locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches. This challenges the prevailing trend toward sophisticated architectures and highlights the importance of thoroughly evaluating simple baselines before investing in complex solutions. To support reproducibility and further exploration, we will release all benchmarks and code upon publication.

[LG-65] Fast and Simple Densest Subgraph with Predictions

链接: https://arxiv.org/abs/2505.12600
作者: Thai Bui,Hoa T. Vu
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the densest subgraph problem and its variants through the lens of learning-augmented algorithms. For this problem, the greedy algorithm by Charikar (APPROX 2000) provides a linear-time 1/2 -approximation, while computing the exact solution typically requires solving a linear program or performing maximum flow this http URL show that given a partial solution, i.e., one produced by a machine learning classifier that captures at least a (1 - \epsilon) -fraction of nodes in the optimal subgraph, it is possible to design an extremely simple linear-time algorithm that achieves a provable (1 - \epsilon) -approximation. Our approach also naturally extends to the directed densest subgraph problem and several NP-hard this http URL experiment on the Twitch Ego Nets dataset shows that our learning-augmented algorithm outperforms Charikar’s greedy algorithm and a baseline that directly returns the predicted densest subgraph without additional algorithmic processing.

[LG-66] A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection

链接: https://arxiv.org/abs/2505.12586
作者: Sanggeon Yun,Ryozo Masukawa,Hyunwoo Oh,Nathaniel D. Bastian,Mohsen Imani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are highly susceptible to adversarial examples–subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, heavy augmentations, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the A Few Large Shifts Assumption, which posits that adversarial perturbations typically induce large representation shifts in a small subset of layers. Building on this, we propose two complementary strategies–Recovery Testing (RT) and Logit-layer Testing (LT)–to expose internal disruptions caused by adversaries. Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive threat models, our method achieves state-of-the-art detection performance with negligible computational overhead and no compromise to clean accuracy.

[LG-67] Adaptive parameter-efficient fine-tuning via Hessian-informed subset selection

链接: https://arxiv.org/abs/2505.12579
作者: Shiyun Xu,Zhiqi Bu
类目: Machine Learning (cs.LG)
*备注: Equal contribution

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) is a highly effective approach for adapting large pre-trained models to downstream tasks with minimal computational overhead. At the core, PEFT methods freeze most parameters and only trains a small subset (say 0.1% of total parameters). Notably, different PEFT methods select different subsets, resulting in varying levels of performance. This variation prompts a key question: how to effectively select the most influential subset to train? We formulate the subset selection as a multi-task problem: maximizing the performance and minimizing the number of trainable parameters. We leverage a series of transformations – including \epsilon -constraint method and second-order Taylor approximation – to arrive at the classical 0-1 knapsack problem, which we solve through the lens of Pareto optimality. Consequently, we propose AdaPEFT, a Hessian-informed PEFT that adapts to various tasks and models, in which the selected subset empirically transfers across training horizons and model sizes. Comments: Equal contribution Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.12579 [cs.LG] (or arXiv:2505.12579v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.12579 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing

链接: https://arxiv.org/abs/2505.12566
作者: Leyang Xue,Yao Fu,Luo Mai,Mahesh K. Marina
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Giant Deep Neural Networks (DNNs), have become indispensable for accurate and robust support of large-scale cloud based AI services. However, serving giant DNNs is prohibitively expensive from an energy consumption viewpoint easily exceeding that of training, due to the enormous scale of GPU clusters needed to hold giant DNN model partitions and replicas. Existing approaches can either optimize energy efficiency or inference accuracy but not both. To overcome this status quo, we propose HybridServe, a novel hybrid DNN model serving system that leverages multiple sized versions (small to giant) of the model to be served in tandem. Through a confidence based hybrid model serving dataflow, HybridServe prefers to serve inference requests with energy-efficient smaller models so long as accuracy is not compromised, thereby reducing the number of replicas needed for giant DNNs. HybridServe also features a dataflow planner for efficient partitioning and replication of candidate models to maximize serving system throughput. Experimental results using a prototype implementation of HybridServe show that it reduces energy footprint by up to 19.8x compared to the state-of-the-art DNN model serving systems while matching the accuracy of serving solely with giant DNNs.

[LG-69] Alternators With Noise Models

链接: https://arxiv.org/abs/2505.12544
作者: Mohammad R. Rezaei,Adji Bousso Dieng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alternators have recently been introduced as a framework for modeling time-dependent data. They often outperform other popular frameworks, such as state-space models and diffusion models, on challenging time-series tasks. This paper introduces a new Alternator model, called Alternator++, which enhances the flexibility of traditional Alternators by explicitly modeling the noise terms used to sample the latent and observed trajectories, drawing on the idea of noise models from the diffusion modeling literature. Alternator++ optimizes the sum of the Alternator loss and a noise-matching loss. The latter forces the noise trajectories generated by the two noise models to approximate the noise trajectories that produce the observed and latent trajectories. We demonstrate the effectiveness of Alternator++ in tasks such as density estimation, time series imputation, and forecasting, showing that it outperforms several strong baselines, including Mambas, ScoreGrad, and Dyffusion.

[LG-70] Private Statistical Estimation via Truncation

链接: https://arxiv.org/abs/2505.12541
作者: Manolis Zampetakis,Felix Zhou
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a novel framework for differentially private (DP) statistical estimation via data truncation, addressing a key challenge in DP estimation when the data support is unbounded. Traditional approaches rely on problem-specific sensitivity analysis, limiting their applicability. By leveraging techniques from truncated statistics, we develop computationally efficient DP estimators for exponential family distributions, including Gaussian mean and covariance estimation, achieving near-optimal sample complexity. Previous works on exponential families only consider bounded or one-dimensional families. Our approach mitigates sensitivity through truncation while carefully correcting for the introduced bias using maximum likelihood estimation and DP stochastic gradient descent. Along the way, we establish improved uniform convergence guarantees for the log-likelihood function of exponential families, which may be of independent interest. Our results provide a general blueprint for DP algorithm design via truncated statistics.

[LG-71] Harnessing the Universal Geometry of Embeddings

链接: https://arxiv.org/abs/2505.12540
作者: Rishi Jha,Collin Zhang,Vitaly Shmatikov,John X. Morris
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets. The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.12540 [cs.LG] (or arXiv:2505.12540v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.12540 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] Framework of Voting Prediction of Parliament Members

链接: https://arxiv.org/abs/2505.12535
作者: Zahi Mizrahi,Shai Berkovitz,Nimrod Talmon,Michael Fire
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Keeping track of how lawmakers vote is essential for government transparency. While many parliamentary voting records are available online, they are often difficult to interpret, making it challenging to understand legislative behavior across parliaments and predict voting outcomes. Accurate prediction of votes has several potential benefits, from simplifying parliamentary work by filtering out bills with a low chance of passing to refining proposed legislation to increase its likelihood of approval. In this study, we leverage advanced machine learning and data analysis techniques to develop a comprehensive framework for predicting parliamentary voting outcomes across multiple legislatures. We introduce the Voting Prediction Framework (VPF) - a data-driven framework designed to forecast parliamentary voting outcomes at the individual legislator level and for entire bills. VPF consists of three key components: (1) Data Collection - gathering parliamentary voting records from multiple countries using APIs, web crawlers, and structured databases; (2) Parsing and Feature Integration - processing and enriching the data with meaningful features, such as legislator seniority, and content-based characteristics of a given bill; and (3) Prediction Models - using machine learning to forecast how each parliament member will vote and whether a bill is likely to pass. The framework will be open source, enabling anyone to use or modify the framework. To evaluate VPF, we analyzed over 5 million voting records from five countries - Canada, Israel, Tunisia, the United Kingdom and the USA. Our results show that VPF achieves up to 85% precision in predicting individual votes and up to 84% accuracy in predicting overall bill outcomes. These findings highlight VPF’s potential as a valuable tool for political analysis, policy research, and enhancing public access to legislative decision-making.

[LG-73] ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models

链接: https://arxiv.org/abs/2505.12534
作者: Adrian Mirza,Nawaf Alampara,Martiño Ríos-García,Mohamed Abdelalim,Jack Butler,Bethany Connolly,Tunca Dogan,Marianna Nezhurina,Bünyamin Şen,Santosh Tirunagari,Mark Worrall,Adamo Young,Philippe Schwaller,Michael Pieler,Kevin Maik Jablonka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models have shown remarkable success across scientific domains, yet their impact in chemistry remains limited due to the absence of diverse, large-scale, high-quality datasets that reflect the field’s multifaceted nature. We present the ChemPile, an open dataset containing over 75 billion tokens of curated chemical data, specifically built for training and evaluating general-purpose models in the chemical sciences. The dataset mirrors the human learning journey through chemistry – from educational foundations to specialized expertise – spanning multiple modalities and content types including structured data in diverse chemical representations (SMILES, SELFIES, IUPAC names, InChI, molecular renderings), scientific and educational text, executable code, and chemical images. ChemPile integrates foundational knowledge (textbooks, lecture notes), specialized expertise (scientific articles and language-interfaced data), visual understanding (molecular structures, diagrams), and advanced reasoning (problem-solving traces and code) – mirroring how human chemists develop expertise through diverse learning materials and experiences. Constructed through hundreds of hours of expert curation, the ChemPile captures both foundational concepts and domain-specific complexity. We provide standardized training, validation, and test splits, enabling robust benchmarking. ChemPile is openly released via HuggingFace with a consistent API, permissive license, and detailed documentation. We hope the ChemPile will serve as a catalyst for chemical AI, enabling the development of the next generation of chemical foundation models.

[LG-74] Enforcing Fairness Where It Matters: An Approach Based on Difference-of-Convex Constraints

链接: https://arxiv.org/abs/2505.12530
作者: Yutian He,Yankun Huang,Yao Yao,Qihang Lin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 43 pages, 10 figures, 1 table

点击查看摘要

Abstract:Fairness in machine learning has become a critical concern, particularly in high-stakes applications. Existing approaches often focus on achieving full fairness across all score ranges generated by predictive models, ensuring fairness in both high and low-scoring populations. However, this stringent requirement can compromise predictive performance and may not align with the practical fairness concerns of stakeholders. In this work, we propose a novel framework for building partially fair machine learning models, which enforce fairness within a specific score range of interest, such as the middle range where decisions are most contested, while maintaining flexibility in other regions. We introduce two statistical metrics to rigorously evaluate partial fairness within a given score range, such as the top 20%-40% of scores. To achieve partial fairness, we propose an in-processing method by formulating the model training problem as constrained optimization with difference-of-convex constraints, which can be solved by an inexact difference-of-convex algorithm (IDCA). We provide the complexity analysis of IDCA for finding a nearly KKT point. Through numerical experiments on real-world datasets, we demonstrate that our framework achieves high predictive performance while enforcing partial fairness where it matters most.

[LG-75] Never Skip a Batch: Continuous Training of Temporal GNNs via Adaptive Pseudo-Supervision

链接: https://arxiv.org/abs/2505.12526
作者: Alexander Panyshev,Dmitry Vinichenko,Oleg Travkin,Roman Alferov,Alexey Zaytsev
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal Graph Networks (TGNs), while being accurate, face significant training inefficiencies due to irregular supervision signals in dynamic graphs, which induce sparse gradient updates. We first theoretically establish that aggregating historical node interactions into pseudo-labels reduces gradient variance, accelerating convergence. Building on this analysis, we propose History-Averaged Labels (HAL), a method that dynamically enriches training batches with pseudo-targets derived from historical label distributions. HAL ensures continuous parameter updates without architectural modifications by converting idle computation into productive learning steps. Experiments on the Temporal Graph Benchmark (TGB) validate our findings and an assumption about slow change of user preferences: HAL accelerates TGNv2 training by up to 15x while maintaining competitive performance. Thus, this work offers an efficient, lightweight, architecture-agnostic, and theoretically motivated solution to label sparsity in temporal graph learning.

[LG-76] HAKES: Scalable Vector Database for Embedding Search Service

链接: https://arxiv.org/abs/2505.12524
作者: Guoyu Hu,Shaofeng Cai,Tien Tuan Anh Dinh,Zhongle Xie,Cong Yue,Gang Chen,Beng Chin Ooi
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern deep learning models capture the semantics of complex data by transforming them into high-dimensional embedding vectors. Emerging applications, such as retrieval-augmented generation, use approximate nearest neighbor (ANN) search in the embedding vector space to find similar data. Existing vector databases provide indexes for efficient ANN searches, with graph-based indexes being the most popular due to their low latency and high recall in real-world high-dimensional datasets. However, these indexes are costly to build, suffer from significant contention under concurrent read-write workloads, and scale poorly to multiple servers. Our goal is to build a vector database that achieves high throughput and high recall under concurrent read-write workloads. To this end, we first propose an ANN index with an explicit two-stage design combining a fast filter stage with highly compressed vectors and a refine stage to ensure recall, and we devise a novel lightweight machine learning technique to fine-tune the index parameters. We introduce an early termination check to dynamically adapt the search process for each query. Next, we add support for writes while maintaining search performance by decoupling the management of the learned parameters. Finally, we design HAKES, a distributed vector database that serves the new index in a disaggregated architecture. We evaluate our index and system against 12 state-of-the-art indexes and three distributed vector databases, using high-dimensional embedding datasets generated by deep learning models. The experimental results show that our index outperforms index baselines in the high recall region and under concurrent read-write workloads. Furthermore, \namesys is scalable and achieves up to 16\times higher throughputs than the baselines. The HAKES project is open-sourced at this https URL. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2505.12524 [cs.DB] (or arXiv:2505.12524v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2505.12524 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-77] Energy-Aware Deep Learning on Resource-Constrained Hardware

链接: https://arxiv.org/abs/2505.12523
作者: Josh Millar,Hamed Haddadi,Anil Madhavapeddy
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:The use of deep learning (DL) on Internet of Things (IoT) and mobile devices offers numerous advantages over cloud-based processing. However, such devices face substantial energy constraints to prolong battery-life, or may even operate intermittently via energy-harvesting. Consequently, \textitenergy-aware approaches for optimizing DL inference and training on such resource-constrained devices have garnered recent interest. We present an overview of such approaches, outlining their methodologies, implications for energy consumption and system-level efficiency, and their limitations in terms of supported network types, hardware platforms, and application scenarios. We hope our review offers a clear synthesis of the evolving energy-aware DL landscape and serves as a foundation for future research in energy-constrained computing.

[LG-78] Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

链接: https://arxiv.org/abs/2505.12514
作者: Hanlin Zhu,Shibo Hao,Zhiting Hu,Jiantao Jiao,Stuart Russell,Yuandong Tian
类目: Machine Learning (cs.LG)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens’’ before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with D steps of continuous CoTs can solve the directed graph reachability problem, where D is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires O(n^2) decoding steps where n is the number of vertices ( Dn ). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.

[LG-79] InnateCoder: Learning Programmatic Options with Foundation Models IJCAI2025

链接: https://arxiv.org/abs/2505.12508
作者: Rubens O. Moraes,Quazi Asif Sadmine,Hendrik Baier,Levi H. S. Lelis
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCAI 2025

点击查看摘要

Abstract:Outside of transfer learning settings, reinforcement learning agents start their learning process from a clean slate. As a result, such agents have to go through a slow process to learn even the most obvious skills required to solve a problem. In this paper, we present InnateCoder, a system that leverages human knowledge encoded in foundation models to provide programmatic policies that encode “innate skills” in the form of temporally extended actions, or options. In contrast to existing approaches to learning options, InnateCoder learns them from the general human knowledge encoded in foundation models in a zero-shot setting, and not from the knowledge the agent gains by interacting with the environment. Then, InnateCoder searches for a programmatic policy by combining the programs encoding these options into larger and more complex programs. We hypothesized that InnateCoder’s way of learning and using options could improve the sampling efficiency of current methods for learning programmatic policies. Empirical results in MicroRTS and Karel the Robot support our hypothesis, since they show that InnateCoder is more sample efficient than versions of the system that do not use options or learn them from experience.

[LG-80] γ-FedHT: Stepsize-Aware Hard-Threshold Gradient Compression in Federated Learning

链接: https://arxiv.org/abs/2505.12479
作者: Rongwei Lu,Yutong Jiang,Jinrui Zhang,Chunyang Li,Yifei Zhu,Bin Chen,Zhi Wang
类目: Machine Learning (cs.LG)
*备注: This article has been accepted for publication in IEEE INFOCOM 2025

点击查看摘要

Abstract:Gradient compression can effectively alleviate communication bottlenecks in Federated Learning (FL). Contemporary state-of-the-art sparse compressors, such as Top- k , exhibit high computational complexity, up to \mathcalO(d\log_2k) , where d is the number of model parameters. The hard-threshold compressor, which simply transmits elements with absolute values higher than a fixed threshold, is thus proposed to reduce the complexity to \mathcalO(d) . However, the hard-threshold compression causes accuracy degradation in FL, where the datasets are non-IID and the stepsize \gamma is decreasing for model convergence. The decaying stepsize reduces the updates and causes the compression ratio of the hard-threshold compression to drop rapidly to an aggressive ratio. At or below this ratio, the model accuracy has been observed to degrade severely. To address this, we propose \gamma -FedHT, a stepsize-aware low-cost compressor with Error-Feedback to guarantee convergence. Given that the traditional theoretical framework of FL does not consider Error-Feedback, we introduce the fundamental conversation of Error-Feedback. We prove that \gamma -FedHT has the convergence rate of \mathcalO(\frac1T) ( T representing total training iterations) under \mu -strongly convex cases and \mathcalO(\frac1\sqrtT) under non-convex cases, \textitsame as FedAVG. Extensive experiments demonstrate that \gamma -FedHT improves accuracy by up to 7.42% over Top- k under equal communication traffic on various non-IID image datasets.

[LG-81] Resolving Latency and Inventory Risk in Market Making with Reinforcement Learning

链接: https://arxiv.org/abs/2505.12465
作者: Junzhe Jiang,Chang Yang,Xinrun Wang,Zhiming Li,Xiao Huang,Bo Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The latency of the exchanges in Market Making (MM) is inevitable due to hardware limitations, system processing times, delays in receiving data from exchanges, the time required for order transmission to reach the market, etc. Existing reinforcement learning (RL) methods for Market Making (MM) overlook the impact of these latency, which can lead to unintended order cancellations due to price discrepancies between decision and execution times and result in undesired inventory accumulation, exposing MM traders to increased market risk. Therefore, these methods cannot be applied in real MM scenarios. To address these issues, we first build a realistic MM environment with random delays of 30-100 milliseconds for order placement and market information reception, and implement a batch matching mechanism that collects orders within every 500 milliseconds before matching them all at once, simulating the batch auction mechanisms adopted by some exchanges. Then, we propose Relaver, an RL-based method for MM to tackle the latency and inventory risk issues. The three main contributions of Relaver are: i) we introduce an augmented state-action space that incorporates order hold time alongside price and volume, enabling Relaver to optimize execution strategies under latency constraints and time-priority matching mechanisms, ii) we leverage dynamic programming (DP) to guide the exploration of RL training for better policies, iii) we train a market trend predictor, which can guide the agent to intelligently adjust the inventory to reduce the risk. Extensive experiments and ablation studies on four real-world datasets demonstrate that \textscRelaver significantly improves the performance of state-of-the-art RL-based MM strategies across multiple metrics.

[LG-82] A Finite-Sample Analysis of Distributionally Robust Averag e-Reward Reinforcement Learning

链接: https://arxiv.org/abs/2505.12462
作者: Zachary Roch,Chi Zhang,George Atia,Yue Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint, work in progress

点击查看摘要

Abstract:Robust reinforcement learning (RL) under the average-reward criterion is crucial for long-term decision making under potential environment mismatches, yet its finite-sample complexity study remains largely unexplored. Existing works offer algorithms with asymptotic guarantees, but the absence of finite-sample analysis hinders its principled understanding and practical deployment, especially in data-limited settings. We close this gap by proposing Robust Halpern Iteration (RHI), the first algorithm with provable finite-sample complexity guarantee. Under standard uncertainty sets – including contamination sets and \ell_p -norm balls – RHI attains an \epsilon -optimal policy with near-optimal sample complexity of \tilde\mathcal O\left(\fracSA\mathcal H^2\epsilon^2\right) , where S and A denote the numbers of states and actions, and \mathcal H is the robust optimal bias span. This result gives the first polynomial sample complexity guarantee for robust average-reward RL. Moreover, our RHI’s independence from prior knowledge distinguishes it from many previous average-reward RL studies. Our work thus constitutes a significant advancement in enhancing the practical applicability of robust average-reward methods to complex, real-world problems.

[LG-83] A Case for Library-Level k-Means Binning in Histogram Gradient-Boosted Trees

链接: https://arxiv.org/abs/2505.12460
作者: Asher Labovich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern gradient-boosted decision trees (GBDTs) accelerate split finding with histogram-based binning, which reduces complexity from O(N) to O(B) given a fixed bin budget B. However, the predominant quantile binning strategy-designed to distribute data points evenly among bins-may overlook critical boundary values that could enhance predictive performance. In this work, we propose replacing quantile binning with a k-means discretizer initialized with quantile bins. We test this swap on 33 OpenML tasks plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets, k-means shows no statistically significant losses at the 5% level and wins in four cases-most strikingly a 55% MSE drop on one particularly skewed dataset-even though k-means’ mean reciprocal rank (MRR) is slightly lower (0.65 vs 0.72). On the 15 classification datasets the two methods are statistically tied (MRR 0.70 vs 0.68) with gaps \leq 0.2 pp. Synthetic experiments confirm consistently large MSE gains-typically 20% and rising to 90% as outlier magnitude increases or bin budget drops. We find that k-means keeps error on par with exact splitting when extra cuts add little value, yet still recovers key split points that quantile overlooks. As such, we advocate for a built-in bin_method=k-means flag, especially in regression tasks and in tight-budget settings such as the 32-64-bin GPU regime-because it is a “safe default” with large upside, yet adds only a one-off, cacheable overhead ( \approx 2s to bin 10M rows on one core).

[LG-84] AltLoRA: Towards Better Gradient Approximation in Low-Rank Adaptation with Alternating Projections

链接: https://arxiv.org/abs/2505.12455
作者: Xin Yu,Yujia Wang,Jinghui Chen,Lingzhou Xue
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as an effective technique for reducing memory overhead in fine-tuning large language models. However, it often suffers from sub-optimal performance compared with full fine-tuning since the update is constrained in the low-rank space. Recent variants such as LoRA-Pro attempt to mitigate this by adjusting the gradients of the low-rank matrices to approximate the full gradient. However, LoRA-Pro’s solution is not unique, and different solutions can lead to significantly varying performance in ablation studies. Besides, to incorporate momentum or adaptive optimization design, approaches like LoRA-Pro must first compute the equivalent gradient, causing a higher memory cost close to full fine-tuning. A key challenge remains in integrating momentum properly into the low-rank space with lower memory cost. In this work, we propose AltLoRA, an alternating projection method that avoids the difficulties in gradient approximation brought by the joint update design, meanwhile integrating momentum without higher memory complexity. Our theoretical analysis provides convergence guarantees and further shows that AltLoRA enables stable feature learning and robustness to transformation invariance. Extensive experiments across multiple tasks demonstrate that AltLoRA outperforms LoRA and its variants, narrowing the gap toward full fine-tuning while preserving superior memory efficiency.

[LG-85] A Learning-Based Ansatz Satisfying Boundary Conditions in Variational Problems

链接: https://arxiv.org/abs/2505.12430
作者: Rafael Florencio,Julio Guerrero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, innovative adaptations of the Ritz Method incorporating deep learning have been developed, known as the Deep Ritz Method. This approach employs a neural network as the test function for variational problems. However, the neural network does not inherently satisfy the boundary conditions of the variational problem. To resolve this issue, the Deep Ritz Method introduces a penalty term into the functional of the variational problem, which can lead to misleading results during the optimization process. In this work, an ansatz is proposed that inherently satisfies the boundary conditions of the variational problem. The results demonstrate that the proposed ansatz not only eliminates misleading outcomes but also reduces complexity while maintaining accuracy, showcasing its practical effectiveness in addressing variational problems.

[LG-86] Embedding principle of homogeneous neural network for classification problem

链接: https://arxiv.org/abs/2505.12419
作者: Jiahan Zhang,Tao Luo,Yaoyu Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding the convergence points and optimization landscape of neural networks is crucial, particularly for homogeneous networks where Karush-Kuhn-Tucker (KKT) points of the associated maximum-margin problem often characterize solutions. This paper investigates the relationship between such KKT points across networks of different widths generated via neuron splitting. We introduce and formalize the \textbfKKT point embedding principle, establishing that KKT points of a homogeneous network’s max-margin problem ( P_\Phi ) can be embedded into the KKT points of a larger network’s problem ( P_\tilde\Phi ) via specific linear isometric transformations corresponding to neuron splitting. We rigorously prove this principle holds for neuron splitting in both two-layer and deep homogeneous networks. Furthermore, we connect this static embedding to the dynamics of gradient flow training with smooth losses. We demonstrate that trajectories initiated from appropriately mapped points remain mapped throughout training and that the resulting \omega -limit sets of directions are correspondingly mapped ( T(L(\theta(0))) = L(\boldsymbol\eta(0)) ), thereby preserving the alignment with KKT directions dynamically when directional convergence occurs. Our findings offer insights into the effects of network width, parameter redundancy, and the structural connections between solutions found via optimization in homogeneous networks of varying sizes.

[LG-87] It Takes a Graph to Know a Graph: Rewiring for Homophily with a Reference Graph

链接: https://arxiv.org/abs/2505.12411
作者: Harel Mendelman,Haggai Maron,Ronen Talmon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) excel at analyzing graph-structured data but struggle on heterophilic graphs, where connected nodes often belong to different classes. While this challenge is commonly addressed with specialized GNN architectures, graph rewiring remains an underexplored strategy in this context. We provide theoretical foundations linking edge homophily, GNN embedding smoothness, and node classification performance, motivating the need to enhance homophily. Building on this insight, we introduce a rewiring framework that increases graph homophily using a reference graph, with theoretical guarantees on the homophily of the rewired graph. To broaden applicability, we propose a label-driven diffusion approach for constructing a homophilic reference graph from node features and training labels. Through extensive simulations, we analyze how the homophily of both the original and reference graphs influences the rewired graph homophily and downstream GNN performance. We evaluate our method on 11 real-world heterophilic datasets and show that it outperforms existing rewiring techniques and specialized GNNs for heterophilic graphs, achieving improved node classification accuracy while remaining efficient and scalable to large graphs.

[LG-88] Engineering application of physics-informed neural networks for Saint-Venant torsion

链接: https://arxiv.org/abs/2505.12389
作者: Su Yeong Jo,Sanghyeon Park,Seungchan Ko,Jongcheon Park,Hosung Kim,Sangseung Lee,Joongoo Jeon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Saint-Venant torsion theory is a classical theory for analyzing the torsional behavior of structural components, and it remains critically important in modern computational design workflows. Conventional numerical methods, including the finite element method (FEM), typically rely on mesh-based approaches to obtain approximate solutions. However, these methods often require complex and computationally intensive techniques to overcome the limitations of approximation, leading to significant increases in computational cost. The objective of this study is to develop a series of novel numerical methods based on physics-informed neural networks (PINN) for solving the Saint-Venant torsion equations. Utilizing the expressive power and the automatic differentiation capability of neural networks, the PINN can solve partial differential equations (PDEs) along with boundary conditions without the need for intricate computational techniques. First, a PINN solver was developed to compute the torsional constant for bars with arbitrary cross-sectional geometries. This was followed by the development of a solver capable of handling cases with sharp geometric transitions; variable-scaling PINN (VS-PINN). Finally, a parametric PINN was constructed to address the limitations of conventional single-instance PINN. The results from all three solvers showed good agreement with reference solutions, demonstrating their accuracy and robustness. Each solver can be selectively utilized depending on the specific requirements of torsional behavior analysis.

[LG-89] Neural Thermodynamics I: Entropic Forces in Deep and Universal Representation Learning

链接: https://arxiv.org/abs/2505.12387
作者: Liu Ziyin,Yizhou Xu,Isaac Chuang
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Mathematical Physics (math-ph); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: preprint

点击查看摘要

Abstract:With the rapid discovery of emergent phenomena in deep learning and large language models, explaining and understanding their cause has become an urgent need. Here, we propose a rigorous entropic-force theory for understanding the learning dynamics of neural networks trained with stochastic gradient descent (SGD) and its variants. Building on the theory of parameter symmetries and an entropic loss landscape, we show that representation learning is crucially governed by emergent entropic forces arising from stochasticity and discrete-time updates. These forces systematically break continuous parameter symmetries and preserve discrete ones, leading to a series of gradient balance phenomena that resemble the equipartition property of thermal systems. These phenomena, in turn, (a) explain the universal alignment of neural representations between AI models and lead to a proof of the Platonic Representation Hypothesis, and (b) reconcile the seemingly contradictory observations of sharpness- and flatness-seeking behavior of deep learning optimization. Our theory and experiments demonstrate that a combination of entropic forces and symmetry breaking is key to understanding emergent phenomena in deep learning.

[LG-90] Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward

链接: https://arxiv.org/abs/2505.12380
作者: Han Weng,Boyi Liu,Yuanfeng Song,Dun Zeng,Yingxiang Yang,Yi Zhan,Longjie Cui,Xiaoming Yin,Yang Sun
类目: Machine Learning (cs.LG); Databases (cs.DB); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to-SQL tasks. However, existing methods often rely on execution-based or LLM-based Bradley-Terry reward models. The former suffers from high execution latency caused by repeated database calls, whereas the latter imposes substantial GPU memory overhead, both of which significantly hinder the efficiency and scalability of RL pipelines. To this end, we propose a novel Text-to-SQL RL fine-tuning framework named Graph-Reward-SQL, which employs the GMNScore outcome reward model. We leverage SQL graph representations to provide accurate reward signals while significantly reducing inference time and GPU memory usage. Building on this foundation, we further introduce StepRTM, a stepwise reward model that provides intermediate supervision over Common Table Expression (CTE) subqueries. This encourages both functional correctness and structural clarity of SQL. Extensive comparative and ablation experiments on standard benchmarks, including Spider and BIRD, demonstrate that our method consistently outperforms existing reward models.

[LG-91] Early Prediction of In-Hospital ICU Mortality Using Innovative First-Day Data: A Review

链接: https://arxiv.org/abs/2505.12344
作者: Han Wang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 23 pages, 1 table

点击查看摘要

Abstract:The intensive care unit (ICU) manages critically ill patients, many of whom face a high risk of mortality. Early and accurate prediction of in-hospital mortality within the first 24 hours of ICU admission is crucial for timely clinical interventions, resource optimization, and improved patient outcomes. Traditional scoring systems, while useful, often have limitations in predictive accuracy and adaptability. Objective: This review aims to systematically evaluate and benchmark innovative methodologies that leverage data available within the first day of ICU admission for predicting in-hospital mortality. We focus on advancements in machine learning, novel biomarker applications, and the integration of diverse data types.

[LG-92] OSS-Bench: Benchmark Generator for Coding LLM s

链接: https://arxiv.org/abs/2505.12331
作者: Yuancheng Jiang,Roland Yap,Zhenkai Liang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:In light of the rapid adoption of AI coding assistants, LLM-assisted development has become increasingly prevalent, creating an urgent need for robust evaluation of generated code quality. Existing benchmarks often require extensive manual effort to create static datasets, rely on indirect or insufficiently challenging tasks, depend on non-scalable ground truth, or neglect critical low-level security evaluations, particularly memory-safety issues. In this work, we introduce OSS-Bench, a benchmark generator that automatically constructs large-scale, live evaluation tasks from real-world open-source software. OSS-Bench replaces functions with LLM-generated code and evaluates them using three natural metrics: compilability, functional correctness, and memory safety, leveraging robust signals like compilation failures, test-suite violations, and sanitizer alerts as ground truth. In our evaluation, the benchmark, instantiated as OSS-Bench(php) and OSS-Bench(sql), profiles 17 diverse LLMs, revealing insights such as intra-family behavioral patterns and inconsistencies between model size and performance. Our results demonstrate that OSS-Bench mitigates overfitting by leveraging the evolving complexity of OSS and highlights LLMs’ limited understanding of low-level code security via extended fuzzing experiments. Overall, OSS-Bench offers a practical and scalable framework for benchmarking the real-world coding capabilities of LLMs.

[LG-93] Neural Graduated Assignment for Maximum Common Edge Subgraphs

链接: https://arxiv.org/abs/2505.12325
作者: Chaolong Ying,Yingqi Ruan,Xuemin Chen,Yaomin Wang,Tianshu Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Maximum Common Edge Subgraph (MCES) problem is a crucial challenge with significant implications in domains such as biology and chemistry. Traditional approaches, which include transformations into max-clique and search-based algorithms, suffer from scalability issues when dealing with larger instances. This paper introduces ``Neural Graduated Assignment’’ (NGA), a simple, scalable, unsupervised-training-based method that addresses these limitations by drawing inspiration from the classical Graduated Assignment (GA) technique. Central to NGA is stacking of neural components that closely resemble the GA process, but with the reparameterization of learnable temperature into higher dimension. We further theoretically analyze the learning dynamics of NGA, showing its design leads to fast convergence, better exploration-exploitation tradeoff, and ability to escape local optima. Extensive experiments across MCES computation, graph similarity estimation, and graph retrieval tasks reveal that NGA not only significantly improves computation time and scalability on large instances but also enhances performance compared to existing methodologies. The introduction of NGA marks a significant advancement in the computation of MCES and offers insights into other assignment problems.

[LG-94] GraphFLEx: Structure Learning Framework for Large Expanding Graphs

链接: https://arxiv.org/abs/2505.12323
作者: Mohit Kataria,Nikita Malik,Sandeep Kumar,Jayadeva
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph structure learning is a core problem in graph-based machine learning, essential for uncovering latent relationships and ensuring model interpretability. However, most existing approaches are ill-suited for large-scale and dynamically evolving graphs, as they often require complete re-learning of the structure upon the arrival of new nodes and incur substantial computational and memory costs. In this work, we propose GraphFLEx: a unified and scalable framework for Graph Structure Learning in Large and Expanding Graphs. GraphFLEx mitigates the scalability bottlenecks by restricting edge formation to structurally relevant subsets of nodes identified through a combination of clustering and coarsening techniques. This dramatically reduces the search space and enables efficient, incremental graph updates. The framework supports 48 flexible configurations by integrating diverse choices of learning paradigms, coarsening strategies, and clustering methods, making it adaptable to a wide range of graph settings and learning objectives. Extensive experiments across 26 diverse datasets and Graph Neural Network architectures demonstrate that GraphFLEx achieves state-of-the-art performance with significantly improved scalability.

[LG-95] Efficient Federated Class-Incremental Learning of Pre-Trained Models via Task-agnostic Low-rank Residual Adaptation

链接: https://arxiv.org/abs/2505.12318
作者: Feng Yu,Jia Hu,Geyong Min
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Parameter-Efficient Fine-Tuning (FedPEFT) reduces communication and computation costs in federated fine-tuning of pre-trained models by updating only a small subset of model parameters. However, existing approaches assume static data distributions, failing to adequately address real-world scenarios where new classes continually emerge, particularly in Federated Class Incremental Learning (FCIL). FCIL faces two key challenges: catastrophic forgetting and performance degradation caused by non-IID data across clients. Unlike current methods that maintain separate task-specific components or suffer from aggregation noise during parameter aggregation, we propose Federated Task-agnostic Low-rank Residual Adaptation (Fed-TaLoRA), a novel parameter-efficient approach for fine-tuning in resource-constrained FCIL scenarios. Specifically, we fine-tune only shared task-agnostic LoRA parameters across sequential tasks, effectively mitigating catastrophic forgetting while enabling efficient knowledge transfer among clients. Based on a theoretical analysis of aggregation, we develop a novel residual weight update mechanism that ensures accurate knowledge consolidation with minimal overhead. Our methodological innovations are attributed to three key strategies: task-agnostic adaptation, post-aggregation model calibration, and strategic placement of LoRA modules. Extensive experiments on multiple benchmark datasets demonstrate that Fed-TaLoRA consistently outperforms state-of-the-art methods in diverse data heterogeneity scenarios while substantially reducing resource requirements.

[LG-96] SenseFlow: A Physics-Informed and Self-Ensembling Iterative Framework for Power Flow Estimation

链接: https://arxiv.org/abs/2505.12302
作者: Zhen Zhao,Wenqi Huang,Zicheng Wang,Jiaxuan Hou,Peng Li,Lei Bai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Power flow estimation plays a vital role in ensuring the stability and reliability of electrical power systems, particularly in the context of growing network complexities and renewable energy integration. However, existing studies often fail to adequately address the unique characteristics of power systems, such as the sparsity of network connections and the critical importance of the unique Slack node, which poses significant challenges in achieving high-accuracy estimations. In this paper, we present SenseFlow, a novel physics-informed and self-ensembling iterative framework that integrates two main designs, the Physics-Informed Power Flow Network (FlowNet) and Self-Ensembling Iterative Estimation (SeIter), to carefully address the unique properties of the power system and thereby enhance the power flow estimation. Specifically, SenseFlow enforces the FlowNet to gradually predict high-precision voltage magnitudes and phase angles through the iterative SeIter process. On the one hand, FlowNet employs the Virtual Node Attention and Slack-Gated Feed-Forward modules to facilitate efficient global-local communication in the face of network sparsity and amplify the influence of the Slack node on angle predictions, respectively. On the other hand, SeIter maintains an exponential moving average of FlowNet’s parameters to create a robust ensemble model that refines power state predictions throughout the iterative fitting process. Experimental results demonstrate that SenseFlow outperforms existing methods, providing a promising solution for high-accuracy power flow estimation across diverse grid configurations.

[LG-97] BOLT: Block-Orthonormal Lanczos for Trace estimation of matrix functions

链接: https://arxiv.org/abs/2505.12289
作者: Kingsley Yeon,Promit Ghosal,Mihai Anitescu
类目: Numerical Analysis (math.NA); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient matrix trace estimation is essential for scalable computation of log-determinants, matrix norms, and distributional divergences. In many large-scale applications, the matrices involved are too large to store or access in full, making even a single matrix-vector (mat-vec) product infeasible. Instead, one often has access only to small subblocks of the matrix or localized matrix-vector products on restricted index sets. Hutch++ achieves optimal convergence rate but relies on randomized SVD and assumes full mat-vec access, making it difficult to apply in these constrained settings. We propose the Block-Orthonormal Stochastic Lanczos Quadrature (BOLT), which matches Hutch++ accuracy with a simpler implementation based on orthonormal block probes and Lanczos iterations. BOLT builds on the Stochastic Lanczos Quadrature (SLQ) framework, which combines random probing with Krylov subspace methods to efficiently approximate traces of matrix functions, and performs better than Hutch++ in near flat-spectrum regimes. To address memory limitations and partial access constraints, we introduce Subblock SLQ, a variant of BOLT that operates only on small principal submatrices. As a result, this framework yields a proxy KL divergence estimator and an efficient method for computing the Wasserstein-2 distance between Gaussians - both compatible with low-memory and partial-access regimes. We provide theoretical guarantees and demonstrate strong empirical performance across a range of high-dimensional settings.

[LG-98] SchoenbAt: Rethinking Attention with Polynomial basis

链接: https://arxiv.org/abs/2505.12252
作者: Yuhan Guo,Lizhong Ding,Yuwan Yang,Xuewei Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kernelized attention extends the attention mechanism by modeling sequence correlations through kernel functions, making significant progresses in optimizing attention. Under the guarantee of harmonic analysis theory, kernel functions can be expanded with basis functions, inspiring random feature-based approaches to enhance the efficiency of kernelized attention while maintaining predictive performance. However, current random feature-based works are limited to the Fourier basis expansions under Bochner’s theorem. We propose Schoenberg’s theorem-based attention (SchoenbAt), which approximates dot-product kernelized attention with the polynomial basis under Schoenberg’s theorem via random Maclaurin features and applies a two-stage regularization to constrain the input space and restore the output scale, acting as a drop-in replacement of dot-product kernelized attention. Our theoretical proof of the unbiasedness and concentration error bound of SchoenbAt supports its efficiency and accuracy as a kernelized attention approximation, which is also empirically validated under various random feature dimensions. Evaluations on real-world datasets demonstrate that SchoenbAt significantly enhances computational speed while preserving competitive performance in terms of precision, outperforming several efficient attention methods.

[LG-99] ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates

链接: https://arxiv.org/abs/2505.12242
作者: Tingfeng Lan,Yusen Wu,Bin Ma,Zhaoyuan Su,Rui Yang,Tekin Bicer,Dong Li,Yue Cheng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages, 16 figures

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) often exceeds GPU memory limits, prompting systems to offload model states to CPU memory. However, existing offloaded training frameworks like ZeRO-Offload treat all parameters equally and update the full model on the CPU, causing severe GPU stalls, where fast, expensive GPUs sit idle waiting for slow CPU updates and limited-bandwidth PCIe transfers. We present ZenFlow, a new offloading framework that prioritizes important parameters and decouples updates between GPU and CPU. ZenFlow performs in-place updates of important gradients on GPU, while asynchronously offloading and accumulating less important ones on CPU, fully overlapping CPU work with GPU computation. To scale across GPUs, ZenFlow introduces a lightweight gradient selection method that exploits a novel spatial and temporal locality property of important gradients, avoiding costly global synchronization. ZenFlow achieves up to 5x end-to-end speedup, 2x lower PCIe traffic, and reduces GPU stalls by over 85 percent, all while preserving accuracy. Comments: 13 pages, 16 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) ACMclasses: C.1.4; D.4.7 Cite as: arXiv:2505.12242 [cs.DC] (or arXiv:2505.12242v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2505.12242 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-100] Machine Learning Applications Related to Suicide in Military and Veterans: A Scoping Literature Review

链接: https://arxiv.org/abs/2505.12220
作者: Yuhan Zhang,Yishu Wei,Yanshan Wang,Yunyu Xiao, COL (Ret.)Ronald K. Poropatich,Gretchen L. Haas,Yiye Zhang,Chunhua Weng,Jinze Liu,Lisa A. Brenner,James M. Bjork,Yifan Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Suicide remains one of the main preventable causes of death among active service members and veterans. Early detection and prediction are crucial in suicide prevention. Machine learning techniques have yielded promising results in this area recently. This study aims to assess and summarize current research and provides a comprehensive review regarding the application of machine learning techniques in assessing and predicting suicidal ideation, attempts, and mortality among members of military and veteran populations. A keyword search using PubMed, IEEE, ACM, and Google Scholar was conducted, and the PRISMA protocol was adopted for relevant study selection. Thirty-two articles met the inclusion criteria. These studies consistently identified risk factors relevant to mental health issues such as depression, post-traumatic stress disorder (PTSD), suicidal ideation, prior attempts, physical health problems, and demographic characteristics. Machine learning models applied in this area have demonstrated reasonable predictive accuracy. However, additional research gaps still exist. First, many studies have overlooked metrics that distinguish between false positives and negatives, such as positive predictive value and negative predictive value, which are crucial in the context of suicide prevention policies. Second, more dedicated approaches to handling survival and longitudinal data should be explored. Lastly, most studies focused on machine learning methods, with limited discussion of their connection to clinical rationales. In summary, machine learning analyses have identified a wide range of risk factors associated with suicide in military populations. The diversity and complexity of these factors also demonstrates that effective prevention strategies must be comprehensive and flexible. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.12220 [cs.LG] (or arXiv:2505.12220v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.12220 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yishu Wei [view email] [v1] Sun, 18 May 2025 03:38:33 UTC (382 KB)

[LG-101] Of Mice and Machines: A Comparison of Learning Between Real World Mice and RL Agents

链接: https://arxiv.org/abs/2505.12204
作者: Shuo Han,German Espinosa,Junda Huang,Daniel A. Dombeck,Malcolm A. MacIver,Bradly C. Stadie
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 19 pages

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have demonstrated impressive capabilities in complex decision-making tasks. This progress raises a natural question: how do these artificial systems compare to biological agents, which have been shaped by millions of years of evolution? To help answer this question, we undertake a comparative study of biological mice and RL agents in a predator-avoidance maze environment. Through this analysis, we identify a striking disparity: RL agents consistently demonstrate a lack of self-preservation instinct, readily risking ``death’’ for marginal efficiency gains. These risk-taking strategies are in contrast to biological agents, which exhibit sophisticated risk-assessment and avoidance behaviors. Towards bridging this gap between the biological and artificial, we propose two novel mechanisms that encourage more naturalistic risk-avoidance behaviors in RL agents. Our approach leads to the emergence of naturalistic behaviors, including strategic environment assessment, cautious path planning, and predator avoidance patterns that closely mirror those observed in biological systems.

[LG-102] Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

链接: https://arxiv.org/abs/2505.12202
作者: Zhenghao Li,Shengbo Wang,Nian Si
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conservatism, and computational traceability, the literature has introduced DR-RL models with SA-rectangular and S-rectangular adversaries. While most existing statistical analyses focus on SA-rectangular models, owing to their algorithmic simplicity and the optimality of deterministic policies, S-rectangular models more accurately capture distributional discrepancies in many real-world applications and often yield more effective robust randomized policies. In this paper, we study the empirical value iteration algorithm for divergence-based S-rectangular DR-RL and establish near-optimal sample complexity bounds of \widetildeO(|\mathcalS||\mathcalA|(1-\gamma)^-4\varepsilon^-2) , where \varepsilon is the target accuracy, |\mathcalS| and |\mathcalA| denote the cardinalities of the state and action spaces, and \gamma is the discount factor. To the best of our knowledge, these are the first sample complexity results for divergence-based S-rectangular models that achieve optimal dependence on |\mathcalS| , |\mathcalA| , and \varepsilon simultaneously. We further validate this theoretical dependence through numerical experiments on a robust inventory control problem and a theoretical worst-case example, demonstrating the fast learning performance of our proposed algorithm.

[LG-103] BenSParX: A Robust Explainable Machine Learning Framework for Parkinsons Disease Detection from Bengali Conversational Speech

链接: https://arxiv.org/abs/2505.12192
作者: Riad Hossain,Muhammad Ashad Kabir,Arat Ibne Golam Mowla,Animesh Chandra Roy,Ranjit Kumar Ghosh
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 46 pages, 16 figures

点击查看摘要

Abstract:Parkinson’s disease (PD) poses a growing global health challenge, with Bangladesh experiencing a notable rise in PD-related mortality. Early detection of PD remains particularly challenging in resource-constrained settings, where voice-based analysis has emerged as a promising non-invasive and cost-effective alternative. However, existing studies predominantly focus on English or other major languages; notably, no voice dataset for PD exists for Bengali - posing a significant barrier to culturally inclusive and accessible healthcare solutions. Moreover, most prior studies employed only a narrow set of acoustic features, with limited or no hyperparameter tuning and feature selection strategies, and little attention to model explainability. This restricts the development of a robust and generalizable machine learning model. To address this gap, we present BenSparX, the first Bengali conversational speech dataset for PD detection, along with a robust and explainable machine learning framework tailored for early diagnosis. The proposed framework incorporates diverse acoustic feature categories, systematic feature selection methods, and state-of-the-art machine learning algorithms with extensive hyperparameter optimization. Furthermore, to enhance interpretability and trust in model predictions, the framework incorporates SHAP (SHapley Additive exPlanations) analysis to quantify the contribution of individual acoustic features toward PD detection. Our framework achieves state-of-the-art performance, yielding an accuracy of 95.77%, F1 score of 95.57%, and AUC-ROC of 0.982. We further externally validated our approach by applying the framework to existing PD datasets in other languages, where it consistently outperforms state-of-the-art approaches. To facilitate further research and reproducibility, the dataset has been made publicly available at this https URL.

[LG-104] Learning to Dissipate Energy in Oscillatory State-Space Models

链接: https://arxiv.org/abs/2505.12171
作者: Jared Boyer,T. Konstantin Rusch,Daniela Rus
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 2 figures

点击查看摘要

Abstract:State-space models (SSMs) are a class of networks for sequence learning that benefit from fixed state size and linear complexity with respect to sequence length, contrasting the quadratic scaling of typical attention mechanisms. Inspired from observations in neuroscience, Linear Oscillatory State-Space models (LinOSS) are a recently proposed class of SSMs constructed from layers of discretized forced harmonic oscillators. Although these models perform competitively, leveraging fast parallel scans over diagonal recurrent matrices and achieving state-of-the-art performance on tasks with sequence length up to 50k, LinOSS models rely on rigid energy dissipation (“forgetting”) mechanisms that are inherently coupled to the timescale of state evolution. As forgetting is a crucial mechanism for long-range reasoning, we demonstrate the representational limitations of these models and introduce Damped Linear Oscillatory State-Space models (D-LinOSS), a more general class of oscillatory SSMs that learn to dissipate latent state energy on multiple timescales. We analyze the spectral distribution of the model’s recurrent matrices and prove that the SSM layers exhibit stable dynamics under simple, flexible parameterizations. D-LinOSS consistently outperforms previous LinOSS methods on long-range learning tasks, without introducing additional complexity, and simultaneously reduces the hyperparameter search space by 50%.

[LG-105] FABLE: A Localized Targeted Adversarial Attack on Weather Forecasting Models

链接: https://arxiv.org/abs/2505.12167
作者: Yue Deng,Asadullah Hill Galib,Xin Lan,Pang-Ning Tan,Lifeng Luo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Deep learning-based weather forecasting models have recently demonstrated significant performance improvements over gold-standard physics-based simulation tools. However, these models are vulnerable to adversarial attacks, which raises concerns about their trustworthiness. In this paper, we first investigate the feasibility of applying existing adversarial attack methods to weather forecasting models. We argue that a successful attack should (1) not modify significantly its original inputs, (2) be faithful, i.e., achieve the desired forecast at targeted locations with minimal changes to non-targeted locations, and (3) be geospatio-temporally realistic. However, balancing these criteria is a challenge as existing methods are not designed to preserve the geospatio-temporal dependencies of the original samples. To address this challenge, we propose a novel framework called FABLE (Forecast Alteration By Localized targeted advErsarial attack), which employs a 3D discrete wavelet decomposition to extract the varying components of the geospatio-temporal data. By regulating the magnitude of adversarial perturbations across different components, FABLE can generate adversarial inputs that maintain geospatio-temporal coherence while remaining faithful and closely aligned with the original inputs. Experimental results on multiple real-world datasets demonstrate the effectiveness of our framework over baseline methods across various metrics.

[LG-106] Improving Energy Natural Gradient Descent through Woodbury Momentum and Randomization

链接: https://arxiv.org/abs/2505.12149
作者: Andrés Guzmán-Cordero,Felix Dangel,Gil Goldshlager,Marius Zeinhofer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Natural gradient methods significantly accelerate the training of Physics-Informed Neural Networks (PINNs), but are often prohibitively costly. We introduce a suite of techniques to improve the accuracy and efficiency of energy natural gradient descent (ENGD) for PINNs. First, we leverage the Woodbury formula to dramatically reduce the computational complexity of ENGD. Second, we adapt the Subsampled Projected-Increment Natural Gradient Descent algorithm from the variational Monte Carlo literature to accelerate the convergence. Third, we explore the use of randomized algorithms to further reduce the computational cost in the case of large batch sizes. We find that randomization accelerates progress in the early stages of training for low-dimensional problems, and we identify key barriers to attaining acceleration in other scenarios. Our numerical experiments demonstrate that our methods outperform previous approaches, achieving the same L^2 error as the original ENGD up to 75\times faster.

[LG-107] Causal Machine Learning in IoT-based Engineering Problems: A Tool Comparison in the Case of Household Energy Consumption

链接: https://arxiv.org/abs/2505.12147
作者: Nikolaos-Lysias Kosioris,Sotirios Nikoletseas,Gavrilis Filios,Stefanos Panagiotou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid increase in computing power and the ability to store Big Data in the infrastructure has enabled predictions in a large variety of domains by Machine Learning. However, in many cases, existing Machine Learning tools are considered insufficient or incorrect since they exploit only probabilistic dependencies rather than inference logic. Causal Machine Learning methods seem to close this gap. In this paper, two prevalent tools based on Causal Machine Learning methods are compared, as well as their mathematical underpinning background. The operation of the tools is demonstrated by examining their response to 18 queries, based on the IDEAL Household Energy Dataset, published by the University of Edinburgh. First, it was important to evaluate the causal relations assumption that allowed the use of this approach; this was based on the preexisting scientific knowledge of the domain and was implemented by use of the in-built validation tools. Results were encouraging and may easily be extended to other domains.

[LG-108] ransformer learns the cross-task prior and regularization for in-context learning

链接: https://arxiv.org/abs/2505.12138
作者: Fei Lu,Yue Yu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformers have shown a remarkable ability for in-context learning (ICL), making predictions based on contextual examples. However, while theoretical analyses have explored this prediction capability, the nature of the inferred context and its utility for downstream predictions remain open questions. This paper aims to address these questions by examining ICL for inverse linear regression (ILR), where context inference can be characterized by unsupervised learning of underlying weight vectors. Focusing on the challenging scenario of rank-deficient inverse problems, where context length is smaller than the number of unknowns in the weight vectors and regularization is necessary, we introduce a linear transformer to learn the inverse mapping from contextual examples to the underlying weight vector. Our findings reveal that the transformer implicitly learns both a prior distribution and an effective regularization strategy, outperforming traditional ridge regression and regularization methods. A key insight is the necessity of low task dimensionality relative to the context length for successful learning. Furthermore, we numerically verify that the error of the transformer estimator scales linearly with the noise level, the ratio of task dimension to context length, and the condition number of the input data. These results not only demonstrate the potential of transformers for solving ill-posed inverse problems, but also provide a new perspective towards understanding the knowledge extraction mechanism within transformers.

[LG-109] Understanding the Capabilities of Molecular Graph Neural Networks in Materials Science Through Multimodal Learning and Physical Context Encoding CVPR2025

链接: https://arxiv.org/abs/2505.12137
作者: Can Polat,Hasan Kurban,Erchin Serpedin,Mustafa Kurban
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Accepted Spotlight Paper at CVPR 2025 for MM4Mat

点击查看摘要

Abstract:Molecular graph neural networks (GNNs) often focus exclusively on XYZ-based geometric representations and thus overlook valuable chemical context available in public databases like PubChem. This work introduces a multimodal framework that integrates textual descriptors, such as IUPAC names, molecular formulas, physicochemical properties, and synonyms, alongside molecular graphs. A gated fusion mechanism balances geometric and textual features, allowing models to exploit complementary information. Experiments on benchmark datasets indicate that adding textual data yields notable improvements for certain electronic properties, while gains remain limited for others. Furthermore, the GNN architectures display similar performance patterns (improving and deteriorating on analogous targets), suggesting they learn comparable representations rather than distinctly different physical insights.

[LG-110] owards Sustainability in 6G Network Slicing with Energy-Saving and Optimization Methods

链接: https://arxiv.org/abs/2505.12132
作者: Rodrigo Moreira,Tereza C. M. Carvalho,Flávio de Oliveira Silva,Nazim Agoulmine,Joberto S. B. Martins
类目: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, International Workshop on ADVANCEs in ICT Infrastructures and Services, 2025

点击查看摘要

Abstract:The 6G mobile network is the next evolutionary step after 5G, with a prediction of an explosive surge in mobile traffic. It provides ultra-low latency, higher data rates, high device density, and ubiquitous coverage, positively impacting services in various areas. Energy saving is a major concern for new systems in the telecommunications sector because all players are expected to reduce their carbon footprints to contribute to mitigating climate change. Network slicing is a fundamental enabler for 6G/5G mobile networks and various other new systems, such as the Internet of Things (IoT), Internet of Vehicles (IoV), and Industrial IoT (IIoT). However, energy-saving methods embedded in network slicing architectures are still a research gap. This paper discusses how to embed energy-saving methods in network-slicing architectures that are a fundamental enabler for nearly all new innovative systems being deployed worldwide. This paper’s main contribution is a proposal to save energy in network slicing. That is achieved by deploying ML-native agents in NS architectures to dynamically orchestrate and optimize resources based on user demands. The SFI2 network slicing reference architecture is the concrete use case scenario in which contrastive learning improves energy saving for resource allocation.

[LG-111] Metric Graph Kernels via the Tropical Torelli Map

链接: https://arxiv.org/abs/2505.12129
作者: Yueqi Cao,Anthea Monod
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:We propose new graph kernels grounded in the study of metric graphs via tropical algebraic geometry. In contrast to conventional graph kernels that are based on graph combinatorics such as nodes, edges, and subgraphs, our graph kernels are purely based on the geometry and topology of the underlying metric space. A key characterizing property of our construction is its invariance under edge subdivision, making the kernels intrinsically well-suited for comparing graphs that represent different underlying spaces. We develop efficient algorithms for computing these kernels and analyze their complexity, showing that it depends primarily on the genus of the input graphs. Empirically, our kernels outperform existing methods in label-free settings, as demonstrated on both synthetic and real-world benchmark datasets. We further highlight their practical utility through an urban road network classification task.

[LG-112] Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD

链接: https://arxiv.org/abs/2505.12128
作者: Nikita P. Kalinin,Ryan McKenna,Jalaj Upadhyay,Christoph H. Lampert
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Matrix factorization mechanisms for differentially private training have emerged as a promising approach to improve model utility under privacy constraints. In practical settings, models are typically trained over multiple epochs, requiring matrix factorizations that account for repeated participation. Existing theoretical upper and lower bounds on multi-epoch factorization error leave a significant gap. In this work, we introduce a new explicit factorization method, Banded Inverse Square Root (BISR), which imposes a banded structure on the inverse correlation matrix. This factorization enables us to derive an explicit and tight characterization of the multi-epoch error. We further prove that BISR achieves asymptotically optimal error by matching the upper and lower bounds. Empirically, BISR performs on par with state-of-the-art factorization methods, while being simpler to implement, computationally efficient, and easier to analyze.

[LG-113] Discovering Symbolic Differential Equations with Symmetry Invariants

链接: https://arxiv.org/abs/2505.12083
作者: Jianke Yang,Manu Bhat,Bryan Hu,Yadi Cao,Nima Dehmamy,Robin Walters,Rose Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discovering symbolic differential equations from data uncovers fundamental dynamical laws underlying complex systems. However, existing methods often struggle with the vast search space of equations and may produce equations that violate known physical laws. In this work, we address these problems by introducing the concept of \textitsymmetry invariants in equation discovery. We leverage the fact that differential equations admitting a symmetry group can be expressed in terms of differential invariants of symmetry transformations. Thus, we propose to use these invariants as atomic entities in equation discovery, ensuring the discovered equations satisfy the specified symmetry. Our approach integrates seamlessly with existing equation discovery methods such as sparse regression and genetic programming, improving their accuracy and efficiency. We validate the proposed method through applications to various physical systems, such as fluid and reaction-diffusion, demonstrating its ability to recover parsimonious and interpretable equations that respect the laws of physics.

[LG-114] Unsupervised Port Berth Identification from Automatic Identification System Data

链接: https://arxiv.org/abs/2505.12046
作者: Andreas Hadjipieris,Neofytos Dimitriou,Ognjen Arandjelović
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Port berthing sites are regions of high interest for monitoring and optimizing port operations. Data sourced from the Automatic Identification System (AIS) can be superimposed on berths enabling their real-time monitoring and revealing long-term utilization patterns. Ultimately, insights from multiple berths can uncover bottlenecks, and lead to the optimization of the underlying supply chain of the port and beyond. However, publicly available documentation of port berths, even when available, is frequently incomplete - e.g. there may be missing berths or inaccuracies such as incorrect boundary boxes - necessitating a more robust, data-driven approach to port berth localization. In this context, we propose an unsupervised spatial modeling method that leverages AIS data clustering and hyperparameter optimization to identify berthing sites. Trained on one month of freely available AIS data and evaluated across ports of varying sizes, our models significantly outperform competing methods, achieving a mean Bhattacharyya distance of 0.85 when comparing Gaussian Mixture Models (GMMs) trained on separate data splits, compared to 13.56 for the best existing method. Qualitative comparison with satellite images and existing berth labels further supports the superiority of our method, revealing more precise berth boundaries and improved spatial resolution across diverse port environments.

[LG-115] FlashBias: Fast Computation of Attention with Bias

链接: https://arxiv.org/abs/2505.12044
作者: Haixu Wu,Minghao Guo,Yuezhou Ma,Yuanxu Sun,Jianmin Wang,Wojciech Matusik,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention mechanism has emerged as a foundation module of modern deep learning models and has also empowered many milestones in various domains. Moreover, FlashAttention with IO-aware speedup resolves the efficiency issue of standard attention, further promoting its practicality. Beyond canonical attention, attention with bias also widely exists, such as relative position bias in vision and language models and pair representation bias in AlphaFold. In these works, prior knowledge is introduced as an additive bias term of attention weights to guide the learning process, which has been proven essential for model performance. Surprisingly, despite the common usage of attention with bias, its targeted efficiency optimization is still absent, which seriously hinders its wide applications in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalization. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5 \times speedup for AlphaFold, and over 2 \times speedup for attention with bias in vision and language models without loss of accuracy.

[LG-116] Improving regional weather forecasts with neural interpolation

链接: https://arxiv.org/abs/2505.12040
作者: James Jackaman,Oliver Sutton
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we design a neural interpolation operator to improve the boundary data for regional weather models, which is a challenging problem as we are required to map multi-scale dynamics between grid resolutions. In particular, we expose a methodology for approaching the problem through the study of a simplified model, with a view to generalise the results in this work to the dynamical core of regional weather models. Our approach will exploit a combination of techniques from image super-resolution with convolutional neural networks (CNNs) and residual networks, in addition to building the flow of atmospheric dynamics into the neural network

[LG-117] Adaptive Resolving Methods for Reinforcement Learning with Function Approximations

链接: https://arxiv.org/abs/2505.12037
作者: Jiashuo Jiang,Yiming Zong,Yinyu Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) problems are fundamental in online decision-making and have been instrumental in finding an optimal policy for Markov decision processes (MDPs). Function approximations are usually deployed to handle large or infinite state-action space. In our work, we consider the RL problems with function approximation and we develop a new algorithm to solve it efficiently. Our algorithm is based on the linear programming (LP) reformulation and it resolves the LP at each iteration improved with new data arrival. Such a resolving scheme enables our algorithm to achieve an instance-dependent sample complexity guarantee, more precisely, when we have N data, the output of our algorithm enjoys an instance-dependent \tildeO(1/N) suboptimality gap. In comparison to the O(1/\sqrtN) worst-case guarantee established in the previous literature, our instance-dependent guarantee is tighter when the underlying instance is favorable, and the numerical experiments also reveal the efficient empirical performances of our algorithms.

[LG-118] Relation-Aware Graph Foundation Model

链接: https://arxiv.org/abs/2505.12027
作者: Jianxiang Yu,Jiapeng Zhu,Hao Qian,Ziqi Liu,Zhiqiang Zhang,Xiang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have demonstrated remarkable generalization capabilities across various natural language processing (NLP) tasks. Similarly, graph foundation models (GFMs) have emerged as a promising direction in graph learning, aiming to generalize across diverse datasets through large-scale pre-training. However, unlike language models that rely on explicit token representations, graphs lack a well-defined unit for generalization, making it challenging to design effective pre-training strategies. In this work, we propose REEF, a novel framework that leverages relation tokens as the basic units for GFMs. Inspired by the token vocabulary in LLMs, we construct a relation vocabulary of relation tokens to store relational information within graphs. To accommodate diverse relations, we introduce two hypernetworks that adaptively generate the parameters of aggregators and classifiers in graph neural networks based on relation tokens. In addition, we design another hypernetwork to construct dataset-specific projectors and incorporate a dataset-level feature bias into the initial node representations, enhancing flexibility across different datasets with the same relation. Further, we adopt graph data augmentation and a mixed-dataset pre-training strategy, allowing REEF to capture relational diversity more effectively and exhibit strong generalization capabilities. Extensive experiments show that REEF significantly outperforms existing methods on both pre-training and transfer learning tasks, underscoring its potential as a powerful foundation model for graph-based applications.

[LG-119] Spotlight Your Instructions: Instruction-following with Dynamic Attention Steering

链接: https://arxiv.org/abs/2505.12025
作者: Praveen Venkateswaran,Danish Contractor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world applications, users rely on natural language instructions to guide large language models (LLMs) across a wide range of tasks. These instructions are often complex, diverse, and subject to frequent change. However, LLMs do not always attend to these instructions reliably, and users lack simple mechanisms to emphasize their importance beyond modifying prompt wording or structure. To address this, we present an inference-time method that enables users to emphasize specific parts of their prompt by steering the model’s attention toward them, aligning the model’s perceived importance of different prompt tokens with user intent. Unlike prior approaches that are limited to static instructions, require significant offline profiling, or rely on fixed biases, we dynamically update the proportion of model attention given to the user-specified parts–ensuring improved instruction following without performance degradation. We demonstrate that our approach improves instruction following across a variety of tasks involving multiple instructions and generalizes across models of varying scales.

[LG-120] FL-PLAS: Federated Learning with Partial Layer Aggregation for Backdoor Defense Against High-Ratio Malicious Clients

链接: https://arxiv.org/abs/2505.12019
作者: Jianyi Zhang,Ziyin Zhou,Yilong Li,Qichao Jin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 20pages

点击查看摘要

Abstract:Federated learning (FL) is gaining increasing attention as an emerging collaborative machine learning approach, particularly in the context of large-scale computing and data systems. However, the fundamental algorithm of FL, Federated Averaging (FedAvg), is susceptible to backdoor attacks. Although researchers have proposed numerous defense algorithms, two significant challenges remain. The attack is becoming more stealthy and harder to detect, and current defense methods are unable to handle 50% or more malicious users or assume an auxiliary server dataset. To address these challenges, we propose a novel defense algorithm, FL-PLAS, \textbfFederated \textbfLearning based on \textbfPartial\textbf Layer \textbfAggregation \textbfStrategy. In particular, we divide the local model into a feature extractor and a classifier. In each iteration, the clients only upload the parameters of a feature extractor after local training. The server then aggregates these local parameters and returns the results to the clients. Each client retains its own classifier layer, ensuring that the backdoor labels do not impact other clients. We assess the effectiveness of FL-PLAS against state-of-the-art (SOTA) backdoor attacks on three image datasets and compare our approach to six defense strategies. The results of the experiment demonstrate that our methods can effectively protect local models from backdoor attacks. Without requiring any auxiliary dataset for the server, our method achieves a high main-task accuracy with a lower backdoor accuracy even under the condition of 90% malicious users with the attacks of trigger, semantic and edge-case. Comments: 20pages Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2505.12019 [cs.CR] (or arXiv:2505.12019v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.12019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-121] Incentivize Contribution and Learn Parameters Too: Federated Learning with Strategic Data Owners

链接: https://arxiv.org/abs/2505.12010
作者: Drashthi Doshi,Aditya Vema Reddy Kesari,Swaprava Nath,Avishek Ghosh,Suhas S Kowshik
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 19 pages, 12 figures, under review

点击查看摘要

Abstract:Classical federated learning (FL) assumes that the clients have a limited amount of noisy data with which they voluntarily participate and contribute towards learning a global, more accurate model in a principled manner. The learning happens in a distributed fashion without sharing the data with the center. However, these methods do not consider the incentive of an agent for participating and contributing to the process, given that data collection and running a distributed algorithm is costly for the clients. The question of rationality of contribution has been asked recently in the literature and some results exist that consider this problem. This paper addresses the question of simultaneous parameter learning and incentivizing contribution, which distinguishes it from the extant literature. Our first mechanism incentivizes each client to contribute to the FL process at a Nash equilibrium and simultaneously learn the model parameters. However, this equilibrium outcome can be away from the optimal, where clients contribute with their full data and the algorithm learns the optimal parameters. We propose a second mechanism with monetary transfers that is budget balanced and enables the full data contribution along with optimal parameter learning. Large scale experiments with real (federated) datasets (CIFAR-10, FeMNIST, and Twitter) show that these algorithms converge quite fast in practice, yield good welfare guarantees, and better model performance for all agents.

[LG-122] Approximation theory for 1-Lipschitz ResNets

链接: https://arxiv.org/abs/2505.12003
作者: Davide Murari,Takashi Furuya,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:1-Lipschitz neural networks are fundamental for generative modelling, inverse problems, and robust classifiers. In this paper, we focus on 1-Lipschitz residual networks (ResNets) based on explicit Euler steps of negative gradient flows and study their approximation capabilities. Leveraging the Restricted Stone-Weierstrass Theorem, we first show that these 1-Lipschitz ResNets are dense in the set of scalar 1-Lipschitz functions on any compact domain when width and depth are allowed to grow. We also show that these networks can exactly represent scalar piecewise affine 1-Lipschitz functions. We then prove a stronger statement: by inserting norm-constrained linear maps between the residual blocks, the same density holds when the hidden width is fixed. Because every layer obeys simple norm constraints, the resulting models can be trained with off-the-shelf optimisers. This paper provides the first universal approximation guarantees for 1-Lipschitz ResNets, laying a rigorous foundation for their practical use.

[LG-123] Variance-Optimal Arm Selection: Regret Minimization and Best Arm Identification

链接: https://arxiv.org/abs/2505.11985
作者: Sabrina Khurshid,Gourab Ghatak,Mohammad Shahid Abdulla
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper focuses on selecting the arm with the highest variance from a set of K independent arms. Specifically, we focus on two settings: (i) regret setting, that penalizes the number of pulls of suboptimal arms in terms of variance, and (ii) fixed-budget \acBAI setting, that evaluates the ability of an algorithm to determine the arm with the highest variance after a fixed number of pulls. We develop a novel online algorithm called \textttUCB-VV for the regret setting and show that its upper bound on regret for bounded rewards evolves as \mathcalO\left(\logn\right) where n is the horizon. By deriving the lower bound on the regret, we show that \textttUCB-VV is order optimal. For the fixed budget \acBAI setting and propose the \textttSHVV algorithm. We show that the upper bound of the error probability of \textttSHVV evolves as \exp\left(-\fracn\log(K) H\right) , where H represents the complexity of the problem, and this rate matches the corresponding lower bound. We extend the framework from bounded distributions to sub-Gaussian distributions using a novel concentration inequality on the sample variance. Leveraging the same, we derive a concentration inequality for the empirical Sharpe ratio (SR) for sub-Gaussian distributions, which was previously unknown in the literature. Empirical simulations show that \textttUCB-VV consistently outperforms \texttt \epsilon -greedy across different sub-optimality gaps though it is surpassed by \textttVTS, which exhibits the lowest regret, albeit lacking in theoretical guarantees. We also illustrate the superior performance of \textttSHVV, for a fixed budget setting under 6 different setups against uniform sampling. Finally, we conduct a case study to empirically evaluate the performance of the \textttUCB-VV and \textttSHVV in call option trading on 100 stocks generated using \acGBM.

[LG-124] FedHQ: Hybrid Runtime Quantization for Federated Learning

链接: https://arxiv.org/abs/2505.11982
作者: Zihao Zheng,Ziyao Wang,Xiuping Cui,Maoliang Li,Jiayu Chen, Yun (Eric)Liang,Ang Li,Xiang Chen
类目: Machine Learning (cs.LG)
*备注: 5 figures and 4 tables

点击查看摘要

Abstract:Federated Learning (FL) is a decentralized model training approach that preserves data privacy but struggles with low efficiency. Quantization, a powerful training optimization technique, has been widely explored for integration into FL. However, many studies fail to consider the distinct performance attribution between particular quantization strategies, such as post-training quantization (PTQ) or quantization-aware training (QAT). As a result, existing FL quantization methods rely solely on either PTQ or QAT, optimizing for speed or accuracy while compromising the other. To efficiently accelerate FL and maintain distributed convergence accuracy across various FL settings, this paper proposes a hybrid quantitation approach combining PTQ and QAT for FL systems. We conduct case studies to validate the effectiveness of using hybrid quantization in FL. To solve the difficulty of modeling speed and accuracy caused by device and data heterogeneity, we propose a hardware-related analysis and data-distribution-related analysis to help identify the trade-off boundaries for strategy selection. Based on these, we proposed a novel framework named FedHQ to automatically adopt optimal hybrid strategy allocation for FL systems. Specifically, FedHQ develops a coarse-grained global initialization and fine-grained ML-based adjustment to ensure efficiency and robustness. Experiments show that FedHQ achieves up to 2.47x times training acceleration and up to 11.15% accuracy improvement and negligible extra overhead.

[LG-125] Accelerating Neural Network Training Along Sharp and Flat Directions

链接: https://arxiv.org/abs/2505.11972
作者: Daniyar Zakarin,Sidak Pal Singh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent work has highlighted a surprising alignment between gradients and the top eigenspace of the Hessian – termed the Dominant subspace – during neural network training. Concurrently, there has been growing interest in the distinct roles of sharp and flat directions in the Hessian spectrum. In this work, we study Bulk-SGD, a variant of SGD that restricts updates to the orthogonal complement of the Dominant subspace. Through ablation studies, we characterize the stability properties of Bulk-SGD and identify critical hyperparameters that govern its behavior. We show that updates along the Bulk subspace, corresponding to flatter directions in the loss landscape, can accelerate convergence but may compromise stability. To balance these effects, we introduce interpolated gradient methods that unify SGD, Dom-SGD, and Bulk-SGD. Finally, we empirically connect this subspace decomposition to the Generalized Gauss-Newton and Functional Hessian terms, showing that curvature energy is largely concentrated in the Dominant subspace. Our findings suggest a principled approach to designing curvature-aware optimizers.

[LG-126] PyScrew: A Comprehensive Dataset Collection from Industrial Screw Driving Experiments

链接: https://arxiv.org/abs/2505.11925
作者: Nikolai West,Jochen Deuse
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 2 figures, 7 tables

点击查看摘要

Abstract:This paper presents a comprehensive collection of industrial screw driving datasets designed to advance research in manufacturing process monitoring and quality control. The collection comprises six distinct datasets with over 34,000 individual screw driving operations conducted under controlled experimental conditions, capturing the multifaceted nature of screw driving processes in plastic components. Each dataset systematically investigates specific aspects: natural thread degradation patterns through repeated use (s01), variations in surface friction conditions including contamination and surface treatments (s02), diverse assembly faults with up to 27 error types (s03-s04), and fabrication parameter variations in both upper and lower workpieces through modified injection molding settings (s05-s06). We detail the standardized experimental setup used across all datasets, including hardware specifications, process phases, and data acquisition methods. The hierarchical data model preserves the temporal and operational structure of screw driving processes, facilitating both exploratory analysis and the development of machine learning models. To maximize accessibility, we provide dual access pathways: raw data through Zenodo with a persistent DOI, and a purpose-built Python library (PyScrew) that offers consistent interfaces for data loading, preprocessing, and integration with common analysis workflows. These datasets serve diverse research applications including anomaly detection, predictive maintenance, quality control system development, feature extraction methodology evaluation, and classification of specific error conditions. By addressing the scarcity of standardized, comprehensive datasets in industrial manufacturing, this collection enables reproducible research and fair comparison of analytical approaches in an area of growing importance for industrial automation.

[LG-127] ransformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures

链接: https://arxiv.org/abs/2505.11918
作者: Zhiheng Chen,Ruofan Wu,Guanhua Fang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code available at this https URL

点击查看摘要

Abstract:The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the under standing of pre-trained large language models. However, most recent works have been focusing on studying supervised learning topics such as in-context learning, leaving the field of unsupervised learning largely unexplored. This paper investigates the capabilities of transformers in solving Gaussian Mixture Models (GMMs), a fundamental unsupervised learning problem through the lens of statistical estimation. We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks using a shared transformer backbone. The learned models are empirically demonstrated to effectively mitigate the limitations of classical methods such as Expectation-Maximization (EM) or spectral algorithms, at the same time exhibit reasonable robustness to distribution shifts. Theoretically, we prove that transformers can approximate both the EM algorithm and a core component of spectral methods (cubic tensor power iterations). These results bridge the gap between practical success and theoretical understanding, positioning transformers as versatile tools for unsupervised learning.

[LG-128] Dynamic Perturbed Adaptive Method for Infinite Task-Conflicting Time Series

链接: https://arxiv.org/abs/2505.11902
作者: Jiang You,Xiaozhen Wang,Arben Cela
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We formulate time series tasks as input-output mappings under varying objectives, where the same input may yield different outputs. This challenges a model’s generalization and adaptability. To study this, we construct a synthetic dataset with numerous conflicting subtasks to evaluate adaptation under frequent task shifts. Existing static models consistently fail in such settings. We propose a dynamic perturbed adaptive method based on a trunk-branch architecture, where the trunk evolves slowly to capture long-term structure, and branch modules are re-initialized and updated for each task. This enables continual test-time adaptation and cross-task transfer without relying on explicit task labels. Theoretically, we show that this architecture has strictly higher functional expressivity than static models and LoRA. We also establish exponential convergence of branch adaptation under the Polyak-Lojasiewicz condition. Experiments demonstrate that our method significantly outperforms competitive baselines in complex and conflicting task environments, exhibiting fast adaptation and progressive learning capabilities.

[LG-129] Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform

链接: https://arxiv.org/abs/2505.11892
作者: Josh Alman,Zhao Song
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:The transformer architecture has been widely applied to many machine learning tasks. A main bottleneck in the time to perform transformer computations is a task called attention computation. [Alman and Song, NeurIPS 2023] have shown that in the bounded entry regime, there is an almost linear time algorithm to approximate the attention computation. They also proved that the bounded entry assumption is necessary for a fast algorithm assuming the popular Strong Exponential Time Hypothesis. A new version of transformer which uses position embeddings has recently been very successful. At a high level, position embedding enables the model to capture the correlations between tokens while taking into account their position in the sequence. Perhaps the most popular and effective version is Rotary Position Embedding (RoPE), which was proposed by [Su, Lu, Pan, Murtadha, Wen, and Liu, Neurocomputing 2024]. A main downside of RoPE is that it complicates the attention computation problem, so that previous techniques for designing almost linear time algorithms no longer seem to work. In this paper, we show how to overcome this issue, and give a new algorithm to compute the RoPE attention in almost linear time in the bounded entry regime. (Again, known lower bounds imply that bounded entries are necessary.) Our new algorithm combines two techniques in a novel way: the polynomial method, which was used in prior fast attention algorithms, and the Fast Fourier Transform. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2505.11892 [cs.LG] (or arXiv:2505.11892v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.11892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-130] Bridging the Reality Gap in Digital Twins with Context-Aware Physics-Guided Deep Learning

链接: https://arxiv.org/abs/2505.11847
作者: Sizhe Ma,Katherine A. Flanigan,Mario Bergés
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted to ASCE Journal of Computing in Civil Engineering

点击查看摘要

Abstract:Digital twins (DTs) enable powerful predictive analytics, but persistent discrepancies between simulations and real systems–known as the reality gap–undermine their reliability. Coined in robotics, the term now applies to DTs, where discrepancies stem from context mismatches, cross-domain interactions, and multi-scale dynamics. Among these, context mismatch is pressing and underexplored, as DT accuracy depends on capturing operational context, often only partially observable. However, DTs have a key advantage: simulators can systematically vary contextual factors and explore scenarios difficult or impossible to observe empirically, informing inference and model alignment. While sim-to-real transfer like domain adaptation shows promise in robotics, their application to DTs poses two key challenges. First, unlike one-time policy transfers, DTs require continuous calibration across an asset’s lifecycle–demanding structured information flow, timely detection of out-of-sync states, and integration of historical and new data. Second, DTs often perform inverse modeling, inferring latent states or faults from observations that may reflect multiple evolving contexts. These needs strain purely data-driven models and risk violating physical consistency. Though some approaches preserve validity via reduced-order model, most domain adaptation techniques still lack such constraints. To address this, we propose a Reality Gap Analysis (RGA) module for DTs that continuously integrates new sensor data, detects misalignments, and recalibrates DTs via a query-response framework. Our approach fuses domain-adversarial deep learning with reduced-order simulator guidance to improve context inference and preserve physical consistency. We illustrate the RGA module in a structural health monitoring case study on a steel truss bridge in Pittsburgh, PA, showing faster calibration and better real-world alignment.

[LG-131] Learning on a Razors Edge: the Singularity Bias of Polynomial Neural Networks

链接: https://arxiv.org/abs/2505.11846
作者: Vahid Shahverdi,Giovanni Luca Marchetti,Kathlén Kohn
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:Deep neural networks often infer sparse representations, converging to a subnetwork during the learning process. In this work, we theoretically analyze subnetworks and their bias through the lens of algebraic geometry. We consider fully-connected networks with polynomial activation functions, and focus on the geometry of the function space they parametrize, often referred to as neuromanifold. First, we compute the dimension of the subspace of the neuromanifold parametrized by subnetworks. Second, we show that this subspace is singular. Third, we argue that such singularities often correspond to critical points of the training dynamics. Lastly, we discuss convolutional networks, for which subnetworks and singularities are similarly related, but the bias does not arise.

[LG-132] On the O(fracsqrtdK1/4) Convergence Rate of AdamW Measured by ell_1 Norm

链接: https://arxiv.org/abs/2505.11840
作者: Huan Li,Yiming Dong,Zhouchen Lin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate \frac1K\sum_k=1^KE\left[|\nabla f(x^k)|_1\right]\leq O(\frac\sqrtdCK^1/4) for AdamW measured by \ell_1 norm, where K represents the iteration number, d denotes the model dimension, and C matches the constant in the optimal convergence rate of SGD. Theoretically, we have E\left[|\nabla f(x)|_1\right]\geq\sqrt\frac2d\piE\left[|\nabla f(x)|_2\right] when each element of \nabla f(x) is generated from Gaussian distribution \mathcal N(0,1) . Empirically, our experimental results on real-world deep learning tasks reveal |\nabla f(x)|_1=\varTheta(\sqrtd)|\nabla f(x)|_2 . Both support that our convergence rate can be considered to be analogous to the optimal \frac1K\sum_k=1^KE\left[|\nabla f(x^k)|_2\right]\leq O(\fracCK^1/4) convergence rate of SGD.

[LG-133] Variational Regularized Unbalanced Optimal Transport: Single Network Least Action

链接: https://arxiv.org/abs/2505.11823
作者: Yuhao Sun,Zhenyi Zhang,Zihan Wang,Tiejun Li,Peijie Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Recovering the dynamics from a few snapshots of a high-dimensional system is a challenging task in statistical physics and machine learning, with important applications in computational biology. Many algorithms have been developed to tackle this problem, based on frameworks such as optimal transport and the Schrödinger bridge. A notable recent framework is Regularized Unbalanced Optimal Transport (RUOT), which integrates both stochastic dynamics and unnormalized distributions. However, since many existing methods do not explicitly enforce optimality conditions, their solutions often struggle to satisfy the principle of least action and meet challenges to converge in a stable and reliable way. To address these issues, we propose Variational RUOT (Var-RUOT), a new framework to solve the RUOT problem. By incorporating the optimal necessary conditions for the RUOT problem into both the parameterization of the search space and the loss function design, Var-RUOT only needs to learn a scalar field to solve the RUOT problem and can search for solutions with lower action. We also examined the challenge of selecting a growth penalty function in the widely used Wasserstein-Fisher-Rao metric and proposed a solution that better aligns with biological priors in Var-RUOT. We validated the effectiveness of Var-RUOT on both simulated data and real single-cell datasets. Compared with existing algorithms, Var-RUOT can find solutions with lower action while exhibiting faster convergence and improved training stability.

[LG-134] Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

链接: https://arxiv.org/abs/2505.11821
作者: Siliang Zeng,Quan Wei,William Brown,Oana Frunza,Yuriy Nevmyvaka,Mingyi Hong
类目: Machine Learning (cs.LG)
*备注: work in progress

点击查看摘要

Abstract:This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios, which can be naturally modeled as Markov Decision Processes (MDPs). While existing approaches often train multi-turn LLM agents with trajectory-level advantage estimation in bandit settings, they struggle with turn-level credit assignment across multiple decision steps, limiting their performance on multi-turn reasoning tasks. To address this, we introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions. The strategy is general and can be incorporated into various RL algorithms such as Group Relative Preference Optimization (GRPO). Our experimental evaluation on multi-turn reasoning and search-based tool-use tasks with GRPO implementations highlights the effectiveness of the MDP framework and the turn-level credit assignment in advancing the multi-turn reasoning capabilities of LLM agents in complex decision-making settings. Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines, which fail to invoke tools and achieve only 20-30% exact match accuracy.

[LG-135] JULI: Jailbreak Large Language Models by Self-Introspection

链接: https://arxiv.org/abs/2505.11790
作者: Jesson Wang,Zhanhao Hu,David Wagner
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM’s predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top- 5 token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.

[LG-136] Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

链接: https://arxiv.org/abs/2505.11788
作者: Seungeun Oh,Jinhyuk Kim,Jihong Park,Seung-Woo Ko,Jinho Choi,Tony Q. S. Quek,Seong-Lyun Kim
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: 14 pages, 10 figures, 2 tables; This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM’s uncertainty and LLM’s rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206 \times higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.

[LG-137] Multi-Order Wavelet Derivative Transform for Deep Time Series Forecasting

链接: https://arxiv.org/abs/2505.11781
作者: Ziyu Zhou,Jiaxi Hu,Qingsong Wen,James T. Kwok,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注: Preprint. Work in progress

点击查看摘要

Abstract:In deep time series forecasting, the Fourier Transform (FT) is extensively employed for frequency representation learning. However, it often struggles in capturing multi-scale, time-sensitive patterns. Although the Wavelet Transform (WT) can capture these patterns through frequency decomposition, its coefficients are insensitive to change points in time series, leading to suboptimal modeling. To mitigate these limitations, we introduce the multi-order Wavelet Derivative Transform (WDT) grounded in the WT, enabling the extraction of time-aware patterns spanning both the overall trend and subtle fluctuations. Compared with the standard FT and WT, which model the raw series, the WDT operates on the derivative of the series, selectively magnifying rate-of-change cues and exposing abrupt regime shifts that are particularly informative for time series modeling. Practically, we embed the WDT into a multi-branch framework named WaveTS, which decomposes the input series into multi-scale time-frequency coefficients, refines them via linear layers, and reconstructs them into the time domain via the inverse WDT. Extensive experiments on ten benchmark datasets demonstrate that WaveTS achieves state-of-the-art forecasting accuracy while retaining high computational efficiency.

[LG-138] LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models

链接: https://arxiv.org/abs/2505.11772
作者: Ryan Chen,Youngmin Ko,Zeyu Zhang,Catherine Cho,Sunny Chung,Mauro Giuffré,Dennis L. Shung,Bradly C. Stadie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce \textbfLAMP (\textbfLinear \textbfAttribution \textbfMapping \textbfProbe), a method that shines light onto a black-box language model’s decision surface and studies how reliably a model maps its stated reasons to its predictions through a locally linear model approximating the decision surface. LAMP treats the model’s own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model’s output. By doing so, it reveals which stated factors steer the model’s decisions, and by how much. We apply LAMP to three tasks: \textitsentiment analysis, \textitcontroversial-topic detection, and \textitsafety-prompt auditing. Across these tasks, LAMP reveals that many LLMs exhibit locally linear decision landscapes. In addition, these surfaces correlate with human judgments on explanation quality and, on a clinical case-file data set, aligns with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model behaves consistently with the explanations it provides.

[LG-139] Permutation Randomization on Nonsmooth Nonconvex Optimization: A Theoretical and Experimental Study

链接: https://arxiv.org/abs/2505.11752
作者: Wei Zhang,Arif Hassan Zidan,Afrar Jahin,Yu Bao,Tianming Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While gradient-based optimizers that incorporate randomization often showcase superior performance on complex optimization, the theoretical foundations underlying this superiority remain insufficiently understood. A particularly pressing question has emerged: What is the role of randomization in dimension-free nonsmooth nonconvex optimization? To address this gap, we investigate the theoretical and empirical impact of permutation randomization within gradient-based optimization frameworks, using it as a representative case to explore broader implications. From a theoretical perspective, our analyses reveal that permutation randomization disrupts the shrinkage behavior of gradient-based optimizers, facilitating continuous convergence toward the global optimum given a sufficiently large number of iterations. Additionally, we prove that permutation randomization can preserve the convergence rate of the underlying optimizer. On the empirical side, we conduct extensive numerical experiments comparing permutation-randomized optimizer against three baseline methods. These experiments span tasks such as training deep neural networks with stacked architectures and optimizing noisy objective functions. The results not only corroborate our theoretical insights but also highlight the practical benefits of permutation randomization. In summary, this work delivers both rigorous theoretical justification and compelling empirical evidence for the effectiveness of permutation randomization. Our findings and evidence lay a foundation for extending analytics to encompass a wide array of randomization.

[LG-140] HOME-3: High-Order Momentum Estimator with Third-Power Gradient for Convex and Smooth Nonconvex Optimization

链接: https://arxiv.org/abs/2505.11748
作者: Wei Zhang,Arif Hassan Zidan,Afrar Jahin,Yu Bao,Tianming Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Momentum-based gradients are essential for optimizing advanced machine learning models, as they not only accelerate convergence but also advance optimizers to escape stationary points. While most state-of-the-art momentum techniques utilize lower-order gradients, such as the squared first-order gradient, there has been limited exploration of higher-order gradients, particularly those raised to powers greater than two. In this work, we introduce the concept of high-order momentum, where momentum is constructed using higher-power gradients, with a focus on the third-power of the first-order gradient as a representative case. Our research offers both theoretical and empirical support for this approach. Theoretically, we demonstrate that incorporating third-power gradients can improve the convergence bounds of gradient-based optimizers for both convex and smooth nonconvex problems. Empirically, we validate these findings through extensive experiments across convex, smooth nonconvex, and nonsmooth nonconvex optimization tasks. Across all cases, high-order momentum consistently outperforms conventional low-order momentum methods, showcasing superior performance in various optimization problems.

[LG-141] Neural Importance Sampling of Many Lights SIGGRAPH SIGGRAPH2025

链接: https://arxiv.org/abs/2505.11729
作者: Pedro Figueiredo,Qihao He,Steve Bako,Nima Khademi Kalantari
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 11 pages, 11 figures. Accepted for publication in SIGGRAPH Conference Papers '25; to be presented at SIGGRAPH 2025

点击查看摘要

Abstract:We propose a neural approach for estimating spatially varying light selection distributions to improve importance sampling in Monte Carlo rendering, particularly for complex scenes with many light sources. Our method uses a neural network to predict the light selection distribution at each shading point based on local information, trained by minimizing the KL-divergence between the learned and target distributions in an online manner. To efficiently manage hundreds or thousands of lights, we integrate our neural approach with light hierarchy techniques, where the network predicts cluster-level distributions and existing methods sample lights within clusters. Additionally, we introduce a residual learning strategy that leverages initial distributions from existing techniques, accelerating convergence during training. Our method achieves superior performance across diverse and challenging scenes.

[LG-142] Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

链接: https://arxiv.org/abs/2505.11711
作者: Sagnik Mukherjee,Lifan Yuan,Dilek Hakkani-Tur,Hao Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from updating only a small subnetwork comprising just 5 percent to 30 percent of the parameters, with the rest effectively unchanged. We refer to this phenomenon as parameter update sparsity induced by RL. It is observed across all 7 widely used RL algorithms (e.g., PPO, GRPO, DPO) and all 10 LLMs from different families in our experiments. This sparsity is intrinsic and occurs without any explicit sparsity promoting regularizations or architectural constraints. Finetuning the subnetwork alone recovers the test accuracy, and, remarkably, produces a model nearly identical to the one obtained via full finetuning. The subnetworks from different random seeds, training data, and even RL algorithms show substantially greater overlap than expected by chance. Our analysis suggests that this sparsity is not due to updating only a subset of layers, instead, nearly all parameter matrices receive similarly sparse updates. Moreover, the updates to almost all parameter matrices are nearly full-rank, suggesting RL updates a small subset of parameters that nevertheless span almost the full subspaces that the parameter matrices can represent. We conjecture that the this update sparsity can be primarily attributed to training on data that is near the policy distribution, techniques that encourage the policy to remain close to the pretrained model, such as the KL regularization and gradient clipping, have limited impact.

[LG-143] Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

链接: https://arxiv.org/abs/2505.11708
作者: Diksha Goel,Kristen Moore,Jeff Wang,Minjune Kim,Thanh Thi Nguyen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (MDP-level) and tactical (policy-level) reasoning. At the MDP level, we model cyberattacks as a Partially Observable Markov Decision Processes (POMDPs) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving action preferences. Evaluated across CyberBattleSim environments of increasing complexity, our framework offers interpretable insights into agent behaviour at scale. Unlike previous explainable RL methods, which are often post-hoc, domain-specific, or limited in depth, our approach is both agent- and environment-agnostic, supporting use cases ranging from red-team simulation to RL policy debugging. By transforming black-box learning into actionable behavioural intelligence, our framework enables both defenders and developers to better anticipate, analyse, and respond to autonomous cyber threats.

[LG-144] Invariant Representations via Wasserstein Correlation Maximization

链接: https://arxiv.org/abs/2505.11702
作者: Keenan Eikenberry,Lizuo Liu,Yoonsang Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work investigates the use of Wasserstein correlation – a normalized measure of statistical dependence based on the Wasserstein distance between a joint distribution and the product of its marginals – for unsupervised representation learning. Unlike, for example, contrastive methods, which naturally cluster classes in the latent space, we find that an (auto)encoder trained to maximize Wasserstein correlation between the input and encoded distributions instead acts as a compressor, reducing dimensionality while approximately preserving the topological and geometric properties of the input distribution. More strikingly, we show that Wasserstein correlation maximization can be used to arrive at an (auto)encoder – either trained from scratch, or else one that extends a frozen, pretrained model – that is approximately invariant to a chosen augmentation, or collection of augmentations, and that still approximately preserves the structural properties of the non-augmented input distribution. To do this, we first define the notion of an augmented encoder using the machinery of Markov-Wasserstein kernels. When the maximization objective is then applied to the augmented encoder, as opposed to the underlying, deterministic encoder, the resulting model exhibits the desired invariance properties. Finally, besides our experimental results, which show that even simple feedforward networks can be imbued with invariants or can, alternatively, be used to impart invariants to pretrained models under this training process, we additionally establish various theoretical results for optimal transport-based dependence measures. Code is available at this https URL .

[LG-145] Mollifier Layers: Enabling Efficient High-Order Derivatives in Inverse PDE Learning

链接: https://arxiv.org/abs/2505.11682
作者: Ananyae Kumar Bhartari,Vinayak Vinayak,Vivek B Shenoy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter estimation in inverse problems involving partial differential equations (PDEs) underpins modeling across scientific disciplines, especially when parameters vary in space or time. Physics-informed Machine Learning (PhiML) integrates PDE constraints into deep learning, but prevailing approaches depend on recursive automatic differentiation (autodiff), which produces inaccurate high-order derivatives, inflates memory usage, and underperforms in noisy settings. We propose Mollifier Layers, a lightweight, architecture-agnostic module that replaces autodiff with convolutional operations using analytically defined mollifiers. This reframing of derivative computation as smoothing integration enables efficient, noise-robust estimation of high-order derivatives directly from network outputs. Mollifier Layers attach at the output layer and require no architectural modifications. We compare them with three distinct architectures and benchmark performance across first-, second-, and fourth-order PDEs – including Langevin dynamics, heat diffusion, and reaction-diffusion systems – observing significant improvements in memory efficiency, training time and accuracy for parameter recovery across tasks. To demonstrate practical relevance, we apply Mollifier Layers to infer spatially varying epigenetic reaction rates from super-resolution chromatin imaging data – a real-world inverse problem with biomedical significance. Our results establish Mollifier Layers as an efficient and scalable tool for physics-constrained learning.

[LG-146] Enhancing Code Quality with Generative AI: Boosting Developer Warning Compliance

链接: https://arxiv.org/abs/2505.11677
作者: Hansen Chang,Christian DeLozier
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Programmers have long ignored warnings, especially those generated by static analysis tools, due to the potential for false-positives. In some cases, warnings may be indicative of larger issues, but programmers may not understand how a seemingly unimportant warning can grow into a vulnerability. Because these messages tend to be long and confusing, programmers tend to ignore them if they do not cause readily identifiable issues. Large language models can simplify these warnings, explain the gravity of important warnings, and suggest potential fixes to increase developer compliance with fixing warnings.

[LG-147] A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models

链接: https://arxiv.org/abs/2505.11664
作者: Ziqing Xu,Hancheng Min,Salma Tarmoun,Enrique Mallada,Rene Vidal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Most prior work on the convergence of gradient descent (GD) for overparameterized neural networks relies on strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (large, spectral, balanced). Recent efforts to relax these assumptions focus on two-layer linear networks trained with the squared loss. In this work, we derive a linear convergence rate for training two-layer linear neural networks with GD for general losses and under relaxed assumptions on the step size, width, and initialization. A key challenge in deriving this result is that classical ingredients for deriving convergence rates for nonconvex problems, such as the Polyak-Łojasiewicz (PL) condition and Descent Lemma, do not hold globally for overparameterized neural networks. Here, we prove that these two conditions hold locally with local constants that depend on the weights. Then, we provide bounds on these local constants, which depend on the initialization of the weights, the current loss, and the global PL and smoothness constants of the non-overparameterized model. Based on these bounds, we derive a linear convergence rate for GD. Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size, as verified through our numerical experiments.

[LG-148] UrbanMind: Urban Dynamics Prediction with Multifaceted Spatial-Temporal Large Language Models KDD2025

链接: https://arxiv.org/abs/2505.11654
作者: Yuhang Liu,Yingxue Zhang,Xin Zhang,Ling Tian,Xu Zheng,Yanhua Li,Jun Luo
类目: Machine Learning (cs.LG)
*备注: KDD 2025

点击查看摘要

Abstract:Understanding and predicting urban dynamics is crucial for managing transportation systems, optimizing urban planning, and enhancing public services. While neural network-based approaches have achieved success, they often rely on task-specific architectures and large volumes of data, limiting their ability to generalize across diverse urban scenarios. Meanwhile, Large Language Models (LLMs) offer strong reasoning and generalization capabilities, yet their application to spatial-temporal urban dynamics remains underexplored. Existing LLM-based methods struggle to effectively integrate multifaceted spatial-temporal data and fail to address distributional shifts between training and testing data, limiting their predictive reliability in real-world applications. To bridge this gap, we propose UrbanMind, a novel spatial-temporal LLM framework for multifaceted urban dynamics prediction that ensures both accurate forecasting and robust generalization. At its core, UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder with specialized masking strategies that capture intricate spatial-temporal dependencies and intercorrelations among multifaceted urban dynamics. Additionally, we design a semantic-aware prompting and fine-tuning strategy that encodes spatial-temporal contextual details into prompts, enhancing LLMs’ ability to reason over spatial-temporal patterns. To further improve generalization, we introduce a test time adaptation mechanism with a test data reconstructor, enabling UrbanMind to dynamically adjust to unseen test data by reconstructing LLM-generated embeddings. Extensive experiments on real-world urban datasets across multiple cities demonstrate that UrbanMind consistently outperforms state-of-the-art baselines, achieving high accuracy and robust generalization, even in zero-shot settings.

[LG-149] Joint Graph Estimation and Signal Restoration for Robust Federated Learning

链接: https://arxiv.org/abs/2505.11648
作者: Tsutahiro Fukuhara,Junya Hara,Hiroshi Higashi,Yuichi Tanaka
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Preprint submitted to the 2025 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Istanbul, Turkey, Aug. 2025. 8 pages, 2 figures

点击查看摘要

Abstract:We propose a robust aggregation method for model parameters in federated learning (FL) under noisy communications. FL is a distributed machine learning paradigm in which a central server aggregates local model parameters from multiple clients. These parameters are often noisy and/or have missing values during data collection, training, and communication between the clients and server. This may cause a considerable drop in model accuracy. To address this issue, we learn a graph that represents pairwise relationships between model parameters of the clients during aggregation. We realize it with a joint problem of graph learning and signal (i.e., model parameters) restoration. The problem is formulated as a difference-of-convex (DC) optimization, which is efficiently solved via a proximal DC algorithm. Experimental results on MNIST and CIFAR-10 datasets show that the proposed method outperforms existing approaches by up to 2 – 5% in classification accuracy under biased data distributions and noisy conditions.

[LG-150] Accelerating Natural Gradient Descent for PINNs with Randomized Numerical Linear Algebra

链接: https://arxiv.org/abs/2505.11638
作者: Ivan Bioli,Carlo Marcati,Giancarlo Sangalli
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Natural Gradient Descent (NGD) has emerged as a promising optimization algorithm for training neural network-based solvers for partial differential equations (PDEs), such as Physics-Informed Neural Networks (PINNs). However, its practical use is often limited by the high computational cost of solving linear systems involving the Gramian matrix. While matrix-free NGD methods based on the conjugate gradient (CG) method avoid explicit matrix inversion, the ill-conditioning of the Gramian significantly slows the convergence of CG. In this work, we extend matrix-free NGD to broader classes of problems than previously considered and propose the use of Randomized Nyström preconditioning to accelerate convergence of the inner CG solver. The resulting algorithm demonstrates substantial performance improvements over existing NGD-based methods on a range of PDE problems discretized using neural networks.

[LG-151] Generalization Guarantees for Learning Branch-and-Cut Policies in Integer Programming

链接: https://arxiv.org/abs/2505.11636
作者: Hongyu Cheng,Amitabh Basu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixed-integer programming (MIP) provides a powerful framework for optimization problems, with Branch-and-Cut (BC) being the predominant algorithm in state-of-the-art solvers. The efficiency of BC critically depends on heuristic policies for making sequential decisions, including node selection, cut selection, and branching variable selection. While traditional solvers often employ heuristics with manually tuned parameters, recent approaches increasingly leverage machine learning, especially neural networks, to learn these policies directly from data. A key challenge is to understand the theoretical underpinnings of these learned policies, particularly their generalization performance from finite data. This paper establishes rigorous sample complexity bounds for learning BC policies where the scoring functions guiding each decision step (node, cut, branch) have a certain piecewise polynomial structure. This structure generalizes the linear models that form the most commonly deployed policies in practice and investigated recently in a foundational series of theoretical works by Balcan et al. Such piecewise polynomial policies also cover the neural network architectures (e.g., using ReLU activations) that have been the focal point of contemporary practical studies. Consequently, our theoretical framework closely reflects the models utilized by practitioners investigating machine learning within BC, offering a unifying perspective relevant to both established theory and modern empirical research in this area. Furthermore, our theory applies to quite general sequential decision making problems beyond BC.

[LG-152] he Gaussian-Multinoulli Restricted Boltzmann Machine: A Potts Model Extension of the GRBM

链接: https://arxiv.org/abs/2505.11635
作者: Nikhil Kapasi,William Whitehead,Luke Theogarajan
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures (1 figure has 2 subfigures), conference

点击查看摘要

Abstract:Many real-world tasks, from associative memory to symbolic reasoning, demand discrete, structured representations that standard continuous latent models struggle to express naturally. We introduce the Gaussian-Multinoulli Restricted Boltzmann Machine (GM-RBM), a generative energy-based model that extends the Gaussian-Bernoulli RBM (GB-RBM) by replacing binary hidden units with q -state Potts variables. This modification enables a combinatorially richer latent space and supports learning over multivalued, interpretable latent concepts. We formally derive GM-RBM’s energy function, learning dynamics, and conditional distributions, showing that it preserves tractable inference and training through contrastive divergence. Empirically, we demonstrate that GM-RBMs model complex multimodal distributions more effectively than binary RBMs, outperforming them on tasks involving analogical recall and structured memory. Our results highlight GM-RBMs as a scalable framework for discrete latent inference with enhanced expressiveness and interoperability.

[LG-153] Enhancing Network Anomaly Detection with Quantum GANs and Successive Data Injection for Multivariate Time Series

链接: https://arxiv.org/abs/2505.11631
作者: Wajdi Hammami,Soumaya Cherkaoui,Shengrui Wang
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum computing may offer new approaches for advancing machine learning, including in complex tasks such as anomaly detection in network traffic. In this paper, we introduce a quantum generative adversarial network (QGAN) architecture for multivariate time-series anomaly detection that leverages variational quantum circuits (VQCs) in combination with a time-window shifting technique, data re-uploading, and successive data injection (SuDaI). The method encodes multivariate time series data as rotation angles. By integrating both data re-uploading and SuDaI, the approach maps classical data into quantum states efficiently, helping to address hardware limitations such as the restricted number of available qubits. In addition, the approach employs an anomaly scoring technique that utilizes both the generator and the discriminator output to enhance the accuracy of anomaly detection. The QGAN was trained using the parameter shift rule and benchmarked against a classical GAN. Experimental results indicate that the quantum model achieves a accuracy high along with high recall and F1-scores in anomaly detection, and attains a lower MSE compared to the classical model. Notably, the QGAN accomplishes this performance with only 80 parameters, demonstrating competitive results with a compact architecture. Tests using a noisy simulator suggest that the approach remains effective under realistic noise-prone conditions.

[LG-154] Adaptive Robust Optimization with Data-Driven Uncertainty for Enhancing Distribution System Resilience

链接: https://arxiv.org/abs/2505.11627
作者: Shuyi Chen,Shixiang Zhu,Ramteen Sioshansi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extreme weather events are placing growing strain on electric power systems, exposing the limitations of purely reactive responses and prompting the need for proactive resilience planning. However, existing approaches often rely on simplified uncertainty models and decouple proactive and reactive decisions, overlooking their critical interdependence. This paper proposes a novel tri-level optimization framework that integrates proactive infrastructure investment, adversarial modeling of spatio-temporal disruptions, and adaptive reactive response. We construct high-probability, distribution-free uncertainty sets using conformal prediction to capture complex and data-scarce outage patterns. To solve the resulting nested decision problem, we derive a bi-level reformulation via strong duality and develop a scalable Benders decomposition algorithm. Experiments on both real and synthetic data demonstrate that our approach consistently outperforms conventional robust and two-stage methods, achieving lower worst-case losses and more efficient resource allocation, especially under tight operational constraints and large-scale uncertainty.

[LG-155] Regularity and Stability Properties of Selective SSMs with Discontinuous Gating

链接: https://arxiv.org/abs/2505.11602
作者: Nikola Zubić,Davide Scaramuzza
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 21 page, 6 theorems

点击查看摘要

Abstract:Deep Selective State-Space Models (SSMs), characterized by input-dependent, time-varying parameters, offer significant expressive power but pose challenges for stability analysis, especially with discontinuous gating signals. In this paper, we investigate the stability and regularity properties of continuous-time selective SSMs through the lens of passivity and Input-to-State Stability (ISS). We establish that intrinsic energy dissipation guarantees exponential forgetting of past states. Crucially, we prove that the unforced system dynamics possess an underlying minimal quadratic energy function whose defining matrix exhibits robust \textAUC_\textloc regularity, accommodating discontinuous gating. Furthermore, assuming a universal quadratic storage function ensures passivity across all inputs, we derive parametric LMI conditions and kernel constraints that limit gating mechanisms, formalizing “irreversible forgetting” of recurrent models. Finally, we provide sufficient conditions for global ISS, linking uniform local dissipativity to overall system robustness. Our findings offer a rigorous framework for understanding and designing stable and reliable deep selective SSMs.

[LG-156] A Training Framework for Optimal and Stable Training of Polynomial Neural Networks

链接: https://arxiv.org/abs/2505.11589
作者: Forsad Al Hossain,Tauhidur Rahman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:By replacing standard non-linearities with polynomial activations, Polynomial Neural Networks (PNNs) are pivotal for applications such as privacy-preserving inference via Homomorphic Encryption (HE). However, training PNNs effectively presents a significant challenge: low-degree polynomials can limit model expressivity, while higher-degree polynomials, crucial for capturing complex functions, often suffer from numerical instability and gradient explosion. We introduce a robust and versatile training framework featuring two synergistic innovations: 1) a novel Boundary Loss that exponentially penalizes activation inputs outside a predefined stable range, and 2) Selective Gradient Clipping that effectively tames gradient magnitudes while preserving essential Batch Normalization statistics. We demonstrate our framework’s broad efficacy by training PNNs within deep architectures composed of HE-compatible layers (e.g., linear layers, average pooling, batch normalization, as used in ResNet variants) across diverse image, audio, and human activity recognition datasets. These models consistently achieve high accuracy with low-degree polynomial activations (such as degree 2) and, critically, exhibit stable training and strong performance with polynomial degrees up to 22, where standard methods typically fail or suffer severe degradation. Furthermore, the performance of these PNNs achieves a remarkable parity, closely approaching that of their original ReLU-based counterparts. Extensive ablation studies validate the contributions of our techniques and guide hyperparameter selection. We confirm the HE-compatibility of the trained models, advancing the practical deployment of accurate, stable, and secure deep learning inference.

[LG-157] HessFormer: Hessians at Foundation Scale

链接: https://arxiv.org/abs/2505.11564
作者: Diego Granziol
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages

点击查看摘要

Abstract:Whilst there have been major advancements in the field of first order optimisation of deep learning models, where state of the art open source mixture of expert models go into the hundreds of billions of parameters, methods that rely on Hessian vector products, are still limited to run on a single GPU and thus cannot even work for models in the billion parameter range. We release a software package \textbfHessFormer, which integrates nicely with the well known Transformers package and allows for distributed hessian vector computation across a single node with multiple GPUs. Underpinning our implementation is a distributed stochastic lanczos quadrature algorithm, which we release for public consumption. Using this package we investigate the Hessian spectral density of the recent Deepseek 70 bn parameter model.

[LG-158] Policy Gradient with Second Order Momentum

链接: https://arxiv.org/abs/2505.11561
作者: Tianyu Sun
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We develop Policy Gradient with Second-Order Momentum (PG-SOM), a lightweight second-order optimisation scheme for reinforcement-learning policies. PG-SOM augments the classical REINFORCE update with two exponentially weighted statistics: a first-order gradient average and a diagonal approximation of the Hessian. By preconditioning the gradient with this curvature estimate, the method adaptively rescales each parameter, yielding faster and more stable ascent of the expected return. We provide a concise derivation, establish that the diagonal Hessian estimator is unbiased and positive-definite under mild regularity assumptions, and prove that the resulting update is a descent direction in expectation. Numerical experiments on standard control benchmarks show up to a 2.1x increase in sample efficiency and a substantial reduction in variance compared to first-order and Fisher-matrix baselines. These results indicate that even coarse second-order information can deliver significant practical gains while incurring only D memory overhead for a D-parameter policy. All code and reproducibility scripts will be made publicly available.

[LG-159] A Survey of Learning-Based Intrusion Detection Systems for In-Vehicle Network

链接: https://arxiv.org/abs/2505.11551
作者: Muzun Althunayyan,Amir Javed,Omer Rana
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Connected and Autonomous Vehicles (CAVs) enhance mobility but face cybersecurity threats, particularly through the insecure Controller Area Network (CAN) bus. Cyberattacks can have devastating consequences in connected vehicles, including the loss of control over critical systems, necessitating robust security solutions. In-vehicle Intrusion Detection Systems (IDSs) offer a promising approach by detecting malicious activities in real time. This survey provides a comprehensive review of state-of-the-art research on learning-based in-vehicle IDSs, focusing on Machine Learning (ML), Deep Learning (DL), and Federated Learning (FL) approaches. Based on the reviewed studies, we critically examine existing IDS approaches, categorising them by the types of attacks they detect - known, unknown, and combined known-unknown attacks - while identifying their limitations. We also review the evaluation metrics used in research, emphasising the need to consider multiple criteria to meet the requirements of safety-critical systems. Additionally, we analyse FL-based IDSs and highlight their limitations. By doing so, this survey helps identify effective security measures, address existing limitations, and guide future research toward more resilient and adaptive protection mechanisms, ensuring the safety and reliability of CAVs.

[LG-160] Cybersecurity threat detection based on a UEBA framework using Deep Autoencoders

链接: https://arxiv.org/abs/2505.11542
作者: Jose Fuentes,Ines Ortega-Fernandez,Nora M. Villanueva,Marta Sestelo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:User and Entity Behaviour Analytics (UEBA) is a broad branch of data analytics that attempts to build a normal behavioural profile in order to detect anomalous events. Among the techniques used to detect anomalies, Deep Autoencoders constitute one of the most promising deep learning models on UEBA tasks, allowing explainable detection of security incidents that could lead to the leak of personal data, hijacking of systems, or access to sensitive business information. In this study, we introduce the first implementation of an explainable UEBA-based anomaly detection framework that leverages Deep Autoencoders in combination with Doc2Vec to process both numerical and textual features. Additionally, based on the theoretical foundations of neural networks, we offer a novel proof demonstrating the equivalence of two widely used definitions for fully-connected neural networks. The experimental results demonstrate the proposed framework capability to detect real and synthetic anomalies effectively generated from real attack data, showing that the models provide not only correct identification of anomalies but also explainable results that enable the reconstruction of the possible origin of the anomaly. Our findings suggest that the proposed UEBA framework can be seamlessly integrated into enterprise environments, complementing existing security systems for explainable threat detection.

[LG-161] Empirical Performance Evaluation of Lane Keeping Assist on Modern Production Vehicles

链接: https://arxiv.org/abs/2505.11534
作者: Yuhang Wang,Abdulaziz Alhuraish,Shuyi Wang,Hao Zhou
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Leveraging a newly released open dataset of Lane Keeping Assist (LKA) systems from production vehicles, this paper presents the first comprehensive empirical analysis of real-world LKA performance. Our study yields three key findings: (i) LKA failures can be systematically categorized into perception, planning, and control errors. We present representative examples of each failure mode through in-depth analysis of LKA-related CAN signals, enabling both justification of the failure mechanisms and diagnosis of when and where each module begins to degrade; (ii) LKA systems tend to follow a fixed lane-centering strategy, often resulting in outward drift that increases linearly with road curvature, whereas human drivers proactively steer slightly inward on similar curved segments; (iii) We provide the first statistical summary and distribution analysis of environmental and road conditions under LKA failures, identifying with statistical significance that faded lane markings, low pavement laneline contrast, and sharp curvature are the most dominant individual factors, along with critical combinations that substantially increase failure likelihood. Building on these insights, we propose a theoretical model that integrates road geometry, speed limits, and LKA steering capability to inform infrastructure design. Additionally, we develop a machine learning-based model to assess roadway readiness for LKA deployment, offering practical tools for safer infrastructure planning, especially in rural areas. This work highlights key limitations of current LKA systems and supports the advancement of safer and more reliable autonomous driving technologies.

[LG-162] DynamicDTA: Drug-Target Binding Affinity Prediction Using Dynamic Descriptors and Graph Representation

链接: https://arxiv.org/abs/2505.11529
作者: Dan Luo,Jinyu Zhou,Le Xu,Sisi Yuan,Xuan Lin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted for publication at Interdisciplinary Sciences: Computational Life Sciences, 2025

点击查看摘要

Abstract:Predicting drug-target binding affinity (DTA) is essential for identifying potential therapeutic candidates in drug discovery. However, most existing models rely heavily on static protein structures, often overlooking the dynamic nature of proteins, which is crucial for capturing conformational flexibility that will be beneficial for protein binding interactions. We introduce DynamicDTA, an innovative deep learning framework that incorporates static and dynamic protein features to enhance DTA prediction. The proposed DynamicDTA takes three types of inputs, including drug sequence, protein sequence, and dynamic descriptors. A molecular graph representation of the drug sequence is generated and subsequently processed through graph convolutional network, while the protein sequence is encoded using dilated convolutions. Dynamic descriptors, such as root mean square fluctuation, are processed through a multi-layer perceptron. These embedding features are fused with static protein features using cross-attention, and a tensor fusion network integrates all three modalities for DTA prediction. Extensive experiments on three datasets demonstrate that DynamicDTA achieves by at least 3.4% improvement in RMSE score with comparison to seven state-of-the-art baseline methods. Additionally, predicting novel drugs for Human Immunodeficiency Virus Type 1 and visualizing the docking complexes further demonstrates the reliability and biological relevance of DynamicDTA.

[LG-163] PRIME: Physics-Related Intelligent Mixture of Experts for Transistor Characteristics Prediction

链接: https://arxiv.org/abs/2505.11523
作者: Zhenxing Dou,Yijiao Wang,Tao Zou,Zhiwei Chen,Fei Liu,Peng Wang,Weisheng Zhao
类目: Machine Learning (cs.LG)
*备注: 8 pages, 6figures

点击查看摘要

Abstract:In recent years, machine learning has been extensively applied to data prediction during process ramp-up, with a particular focus on transistor characteristics for circuit design and manufacture. However, capturing the nonlinear current response across multiple operating regions remains a challenge for neural networks. To address such challenge, a novel machine learning framework, PRIME (Physics-Related Intelligent Mixture of Experts), is proposed to capture and integrate complex regional characteristics. In essence, our framework incorporates physics-based knowledge with data-driven intelligence. By leveraging a dynamic weighting mechanism in its gating network, PRIME adaptively activates the suitable expert model based on distinct input data features. Extensive evaluations are conducted on various gate-all-around (GAA) structures to examine the effectiveness of PRIME and considerable improvements (60%-84%) in prediction accuracy are shown over state-of-the-art models.

[LG-164] Machine learning the first stage in 2SLS: Practical guidance from bias decomposition and simulation

链接: https://arxiv.org/abs/2505.13422
作者: Connor Lennon,Edward Rubin,Glen Waddell
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning (ML) primarily evolved to solve “prediction problems.” The first stage of two-stage least squares (2SLS) is a prediction problem, suggesting potential gains from ML first-stage assistance. However, little guidance exists on when ML helps 2SLS \unicodex2014 or when it hurts. We investigate the implications of inserting ML into 2SLS, decomposing the bias into three informative components. Mechanically, ML-in-2SLS procedures face issues common to prediction and causal-inference settings \unicodex2014 and their interaction. Through simulation, we show linear ML methods (e.g., post-Lasso) work well, while nonlinear methods (e.g., random forests, neural nets) generate substantial bias in second-stage estimates \unicodex2014 potentially exceeding the bias of endogenous OLS.

[LG-165] Minimum-Excess-Work Guidance

链接: https://arxiv.org/abs/2505.13375
作者: Christopher Kolloff,Tobias Höppe,Emmanouil Angelis,Mathias Jacob Schreiner,Stefan Bauer,Andrea Dittadi,Simon Olsson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages, 18 figures

点击查看摘要

Abstract:We propose a regularization framework inspired by thermodynamic work for guiding pre-trained probability flow generative models (e.g., continuous normalizing flows or diffusion models) by minimizing excess work, a concept rooted in statistical mechanics and with strong conceptual connections to optimal transport. Our approach enables efficient guidance in sparse-data regimes common to scientific applications, where only limited target samples or partial density constraints are available. We introduce two strategies: Path Guidance for sampling rare transition states by concentrating probability mass on user-defined subsets, and Observable Guidance for aligning generated distributions with experimental observables while preserving entropy. We demonstrate the framework’s versatility on a coarse-grained protein model, guiding it to sample transition configurations between folded/unfolded states and correct systematic biases using experimental data. The method bridges thermodynamic principles with modern generative architectures, offering a principled, efficient, and physics-inspired alternative to standard fine-tuning in data-scarce domains. Empirical results highlight improved sample efficiency and bias reduction, underscoring its applicability to molecular simulations and beyond.

[LG-166] Smoothed SGD for quantiles: Bahadur representation and Gaussian approximation

链接: https://arxiv.org/abs/2505.13299
作者: Likai Chen,Georg Keilbar,Wei Biao Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This paper considers the estimation of quantiles via a smoothed version of the stochastic gradient descent (SGD) algorithm. By smoothing the score function in the conventional SGD quantile algorithm, we achieve monotonicity in the quantile level in that the estimated quantile curves do not cross. We derive non-asymptotic tail probability bounds for the smoothed SGD quantile estimate both for the case with and without Polyak-Ruppert averaging. For the latter, we also provide a uniform Bahadur representation and a resulting Gaussian approximation result. Numerical studies show good finite sample behavior for our theoretical results.

[LG-167] Conformalized Decision Risk Assessment

链接: https://arxiv.org/abs/2505.13243
作者: Wenbin Zhou,Agni Orfanoudaki,Shixiang Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 36 pages, 17 figures

点击查看摘要

Abstract:High-stakes decisions in domains such as healthcare, energy, and public policy are often made by human experts using domain knowledge and heuristics, yet are increasingly supported by predictive and optimization-based tools. A dominant approach in operations research is the predict-then-optimize paradigm, where a predictive model estimates uncertain inputs, and an optimization model recommends a decision. However, this approach often lacks interpretability and can fail under distributional uncertainty – particularly when the outcome distribution is multi-modal or complex – leading to brittle or misleading decisions. In this paper, we introduce CREDO, a novel framework that quantifies, for any candidate decision, a distribution-free upper bound on the probability that the decision is suboptimal. By combining inverse optimization geometry with conformal prediction and generative modeling, CREDO produces risk certificates that are both statistically rigorous and practically interpretable. This framework enables human decision-makers to audit and validate their own decisions under uncertainty, bridging the gap between algorithmic tools and real-world judgment.

[LG-168] Diffusion Models with Double Guidance: Generate with aggregated datasets

链接: https://arxiv.org/abs/2505.13213
作者: Yanfeng Yang,Kenji Fukumizu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Creating large-scale datasets for training high-performance generative models is often prohibitively expensive, especially when associated attributes or annotations must be provided. As a result, merging existing datasets has become a common strategy. However, the sets of attributes across datasets are often inconsistent, and their naive concatenation typically leads to block-wise missing conditions. This presents a significant challenge for conditional generative modeling when the multiple attributes are used jointly as conditions, thereby limiting the model’s controllability and applicability. To address this issue, we propose a novel generative approach, Diffusion Model with Double Guidance, which enables precise conditional generation even when no training samples contain all conditions simultaneously. Our method maintains rigorous control over multiple conditions without requiring joint annotations. We demonstrate its effectiveness in molecular and image generation tasks, where it outperforms existing baselines both in alignment with target conditional distributions and in controllability under missing condition settings.

[LG-169] A Malliavin-Gamma calculus approach to Score Based Diffusion Generative models for random fields

链接: https://arxiv.org/abs/2505.13189
作者: Giacomo Greco
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:We adopt a Gamma and Malliavin Calculi point of view in order to generalize Score-based diffusion Generative Models (SGMs) to an infinite-dimensional abstract Hilbertian setting. Particularly, we define the forward noising process using Dirichlet forms associated to the Cameron-Martin space of Gaussian measures and Wiener chaoses; whereas by relying on an abstract time-reversal formula, we show that the score function is a Malliavin derivative and it corresponds to a conditional expectation. This allows us to generalize SGMs to the infinite-dimensional setting. Moreover, we extend existing finite-dimensional entropic convergence bounds to this Hilbertian setting by highlighting the role played by the Cameron-Martin norm in the Fisher information of the data distribution. Lastly, we specify our discussion for spherical random fields, considering as source of noise a Whittle-Matérn random spherical field.

[LG-170] Attention-based clustering

链接: https://arxiv.org/abs/2505.13112
作者: Rodrigo Maulen-Soto(SU),Claire Boyer,Pierre Marion(EPFL)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have emerged as a powerful neural network architecture capable of tackling a wide range of learning tasks. In this work, we provide a theoretical analysis of their ability to automatically extract structure from data in an unsupervised setting. In particular, we demonstrate their suitability for clustering when the input data is generated from a Gaussian mixture model. To this end, we study a simplified two-head attention layer and define a population risk whose minimization with unlabeled data drives the head parameters to align with the true mixture centroids.

[LG-171] Universal Semantic Disentangled Privacy-preserving Speech Representation Learning INTERSPEECH2025

链接: https://arxiv.org/abs/2505.13085
作者: Biel Tura Vecino,Subhadeep Maji,Aravind Varier,Antonio Bonafonte,Ivan Valles,Michael Owen,Leif Radel,Grant Strimmel,Seyi Feyisetan,Roberto Barra Chicote,Ariya Rastrow,Constantinos Papayiannis,Volker Leutnant,Trevor Wood
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted at Interspeech 2025

点击查看摘要

Abstract:The use of audio recordings of human speech to train LLMs poses privacy concerns due to these models’ potential to generate outputs that closely resemble artifacts in the training data. In this study, we propose a speaker privacy-preserving representation learning method through the Universal Speech Codec (USC), a computationally efficient encoder-decoder model that disentangles speech into: \textit(i) privacy-preserving semantically rich representations, capturing content and speech paralinguistics, and \textit(ii) residual acoustic and speaker representations that enables high-fidelity reconstruction. Extensive evaluations presented show that USC’s semantic representation preserves content, prosody, and sentiment, while removing potentially identifiable speaker attributes. Combining both representations, USC achieves state-of-the-art speech reconstruction. Additionally, we introduce an evaluation methodology for measuring privacy-preserving properties, aligning with perceptual tests. We compare USC against other codecs in the literature and demonstrate its effectiveness on privacy-preserving representation learning, illustrating the trade-offs of speaker anonymization, paralinguistics retention and content preservation in the learned semantic representations. Audio samples are shared in \hrefthis https URLthis https URL .

[LG-172] Simplicity is Key: An Unsupervised Pretraining Approach for Sparse Radio Channels

链接: https://arxiv.org/abs/2505.13055
作者: Jonathan Ott,Maximilian Stahlke,Tobias Feigl,Bjoern M. Eskofier,Christopher Mutschler
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:We introduce the Sparse pretrained Radio Transformer (SpaRTran), an unsupervised representation learning approach based on the concept of compressed sensing for radio channels. Our approach learns embeddings that focus on the physical properties of radio propagation, to create the optimal basis for fine-tuning on radio-based downstream tasks. SpaRTran uses a sparse gated autoencoder that induces a simplicity bias to the learned representations, resembling the sparse nature of radio propagation. For signal reconstruction, it learns a dictionary that holds atomic features, which increases flexibility across signal waveforms and spatiotemporal signal patterns. Our experiments show that SpaRTran reduces errors by up to 85 % compared to state-of-the-art methods when fine-tuned on radio fingerprinting, a challenging downstream task. In addition, our method requires less pretraining effort and offers greater flexibility, as we train it solely on individual radio signals. SpaRTran serves as an excellent base model that can be fine-tuned for various radio-based downstream tasks, effectively reducing the cost for labeling. In addition, it is significantly more versatile than existing methods and demonstrates superior generalization.

[LG-173] Model Selection for Gaussian-gated Gaussian Mixture of Experts Using Dendrograms of Mixing Measures

链接: https://arxiv.org/abs/2505.13052
作者: Tuan Thai,TrungTin Nguyen,Dat Do,Nhat Ho,Christopher Drovandi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) models constitute a widely utilized class of ensemble learning approaches in statistics and machine learning, known for their flexibility and computational efficiency. They have become integral components in numerous state-of-the-art deep neural network architectures, particularly for analyzing heterogeneous data across diverse domains. Despite their practical success, the theoretical understanding of model selection, especially concerning the optimal number of mixture components or experts, remains limited and poses significant challenges. These challenges primarily stem from the inclusion of covariates in both the Gaussian gating functions and expert networks, which introduces intrinsic interactions governed by partial differential equations with respect to their parameters. In this paper, we revisit the concept of dendrograms of mixing measures and introduce a novel extension to Gaussian-gated Gaussian MoE models that enables consistent estimation of the true number of mixture components and achieves the pointwise optimal convergence rate for parameter estimation in overfitted scenarios. Notably, this approach circumvents the need to train and compare a range of models with varying numbers of components, thereby alleviating the computational burden, particularly in high-dimensional or deep neural network settings. Experimental results on synthetic data demonstrate the effectiveness of the proposed method in accurately recovering the number of experts. It outperforms common criteria such as the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood, while achieving optimal convergence rates for parameter estimation and accurately approximating the regression function.

[LG-174] he role of data partitioning on the performance of EEG-based deep learning models in supervised cross-subject analysis: a preliminary study

链接: https://arxiv.org/abs/2505.13021
作者: Federico Del Pup,Andrea Zanola,Louis Fabrice Tshimanga,Alessandra Bertoldo,Livio Finos,Manfredo Atzori
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted for possible publication. GitHub repository: see this https URL

点击查看摘要

Abstract:Deep learning is significantly advancing the analysis of electroencephalography (EEG) data by effectively discovering highly nonlinear patterns within the signals. Data partitioning and cross-validation are crucial for assessing model performance and ensuring study comparability, as they can produce varied results and data leakage due to specific signal properties (e.g., biometric). Such variability leads to incomparable studies and, increasingly, overestimated performance claims, which are detrimental to the field. Nevertheless, no comprehensive guidelines for proper data partitioning and cross-validation exist in the domain, nor is there a quantitative evaluation of their impact on model accuracy, reliability, and generalizability. To assist researchers in identifying optimal experimental strategies, this paper thoroughly investigates the role of data partitioning and cross-validation in evaluating EEG deep learning models. Five cross-validation settings are compared across three supervised cross-subject classification tasks (BCI, Parkinson’s, and Alzheimer’s disease detection) and four established architectures of increasing complexity (ShallowConvNet, EEGNet, DeepConvNet, and Temporal-based ResNet). The comparison of over 100,000 trained models underscores, first, the importance of using subject-based cross-validation strategies for evaluating EEG deep learning models, except when within-subject analyses are acceptable (e.g., BCI). Second, it highlights the greater reliability of nested approaches (N-LNSO) compared to non-nested counterparts, which are prone to data leakage and favor larger models overfitting to validation data. In conclusion, this work provides EEG deep learning researchers with an analysis of data partitioning and cross-validation and offers guidelines to avoid data leakage, currently undermining the domain with potentially overestimated performance claims.

[LG-175] Asymptotic Performance of Time-Varying Bayesian Optimization

链接: https://arxiv.org/abs/2505.13012
作者: Anthony Bardou,Patrick Thiran
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-Varying Bayesian Optimization (TVBO) is the go-to framework for optimizing a time-varying black-box objective function that may be noisy and expensive to evaluate. Is it possible for the instantaneous regret of a TVBO algorithm to vanish asymptotically, and if so, when? We answer this question of great theoretical importance by providing algorithm-independent lower regret bounds and upper regret bounds for TVBO algorithms, from which we derive sufficient conditions for a TVBO algorithm to have the no-regret property. Our analysis covers all major classes of stationary kernel functions.

[LG-176] Spline Dimensional Decomposition with Interpolation-based Optimal Knot Selection for Stochastic Dynamic Analysis

链接: https://arxiv.org/abs/2505.12879
作者: Yeonsu Kim,Junhan Lee,John T. Hwang,Bingran Wang,Dongjin Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 15 figures

点击查看摘要

Abstract:Forward uncertainty quantification in dynamic systems is challenging due to non-smooth or locally oscillating nonlinear behaviors. Spline dimensional decomposition (SDD) effectively addresses such nonlinearity by partitioning input coordinates via knot placement, yet its accuracy is highly sensitive to the location of internal knots. Optimizing knots through sequential quadratic programming can be effective, yet the optimization process becomes computationally intense. We propose a computationally efficient, interpolation-based method for optimal knot selection in SDD. The method involves three steps: (1) interpolating input-output profiles, (2) defining subinterval-based reference regions, and (3) selecting optimal knot locations at maximum gradient points within each region. The resulting knot vector is then applied to SDD for accurate approximation of non-smooth and locally oscillating responses. A modal analysis of a lower control arm demonstrates that SDD with the proposed knot selection achieves higher accuracy than SDD with uniformly or randomly spaced knots, and also a Gaussian process surrogate model. The proposed SDD exhibits the lowest relative variance error (2.89%), compared to SDD with uniformly spaced knots (12.310%), randomly spaced knots (15.274%), and Gaussian process (5.319%) in the first natural frequency distribution. All surrogate models are constructed using the same 401 simulation datasets, and the relative errors are evaluated against a 2000-sample Monte Carlo simulation. The scalability and applicability of proposed method are demonstrated through stochastic and reliability analyses of mathematical functions (N=1, 3) and a lower control arm system (N=10). The results confirm that both second-moment statistics and reliability estimates can be accurately achieved with only a few hundred function evaluations or finite element simulations.

[LG-177] Causality-Inspired Robustness for Nonlinear Models via Representation Learning

链接: https://arxiv.org/abs/2505.12868
作者: Marin Šola,Peter Bühlmann,Xinwei Shen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Distributional robustness is a central goal of prediction algorithms due to the prevalent distribution shifts in real-world data. The prediction model aims to minimize the worst-case risk among a class of distributions, a.k.a., an uncertainty set. Causality provides a modeling framework with a rigorous robustness guarantee in the above sense, where the uncertainty set is data-driven rather than pre-specified as in traditional distributional robustness optimization. However, current causality-inspired robustness methods possess finite-radius robustness guarantees only in the linear settings, where the causal relationships among the covariates and the response are linear. In this work, we propose a nonlinear method under a causal framework by incorporating recent developments in identifiable representation learning and establish a distributional robustness guarantee. To our best knowledge, this is the first causality-inspired robustness method with such a finite-radius robustness guarantee in nonlinear settings. Empirical validation of the theoretical findings is conducted on both synthetic data and real-world single-cell data, also illustrating that finite-radius robustness is crucial.

[LG-178] A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design

链接: https://arxiv.org/abs/2505.12848
作者: Adarsh Singh
类目: Atomic Physics (physics.atom-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of novel pharmaceuticals represents a significant challenge in modern science, with substantial costs and time investments. Deep generative models have emerged as promising tools for accelerating drug discovery by efficiently exploring the vast chemical space. However, this rapidly evolving field lacks standardized evaluation protocols, impeding fair comparison between approaches. This research presents an extensive analysis of the Molecular Sets (MOSES) platform, a comprehensive benchmarking framework designed to standardize evaluation of deep generative models in molecular design. Through rigorous assessment of multiple generative architectures, including recurrent neural networks, variational autoencoders, and generative adversarial networks, we examine their capabilities in generating valid, unique, and novel molecular structures while maintaining specific chemical properties. Our findings reveal that different architectures exhibit complementary strengths across various metrics, highlighting the complex trade-offs between exploration and exploitation in chemical space. This study provides detailed insights into the current state of the art in molecular generation and establishes a foundation for future advancements in AI-driven drug discovery.

[LG-179] sting Identifiability and Transportability with Observational and Experimental Data

链接: https://arxiv.org/abs/2505.12801
作者: Konstantina Lelova,Gregory F. Cooper,Sofia Triantafillou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transporting causal information learned from experiments in one population to another is a critical challenge in clinical research and decision-making. Causal transportability uses causal graphs to model differences between the source and target populations and identifies conditions under which causal effects learned from experiments can be reused in a different population. Similarly, causal identifiability identifies conditions under which causal effects can be estimated from observational data. However, these approaches rely on knowing the causal graph, which is often unavailable in real-world settings. In this work, we propose a Bayesian method for assessing whether Z-specific (conditional) causal effects are both identifiable and transportable, without knowing the causal graph. Our method combines experimental data from the source population with observational data from the target population to compute the probability that a causal effect is both identifiable from observational data and transportable. When this holds, we leverage both observational data from the target domain and experimental data from the source domain to obtain an unbiased, efficient estimator of the causal effect in the target population. Using simulations, we demonstrate that our method correctly identifies transportable causal effects and improves causal effect estimation.

[LG-180] Accelerated Markov Chain Monte Carlo Algorithms on Discrete States

链接: https://arxiv.org/abs/2505.12599
作者: Bohan Zhou,Shu Liu,Xinzhe Zuo,Wuchen Li
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We propose a class of discrete state sampling algorithms based on Nesterov’s accelerated gradient method, which extends the classical Metropolis-Hastings (MH) algorithm. The evolution of the discrete states probability distribution governed by MH can be interpreted as a gradient descent direction of the Kullback–Leibler (KL) divergence, via a mobility function and a score function. Specifically, this gradient is defined on a probability simplex equipped with a discrete Wasserstein-2 metric with a mobility function. This motivates us to study a momentum-based acceleration framework using damped Hamiltonian flows on the simplex set, whose stationary distribution matches the discrete target distribution. Furthermore, we design an interacting particle system to approximate the proposed accelerated sampling dynamics. The extension of the algorithm with a general choice of potentials and mobilities is also discussed. In particular, we choose the accelerated gradient flow of the relative Fisher information, demonstrating the advantages of the algorithm in estimating discrete score functions without requiring the normalizing constant and keeping positive probabilities. Numerical examples, including sampling on a Gaussian mixture supported on lattices or a distribution on a hypercube, demonstrate the effectiveness of the proposed discrete-state sampling algorithm.

[LG-181] Stacked conformal prediction

链接: https://arxiv.org/abs/2505.12578
作者: Paulo C. Marques F
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:We consider the conformalization of a stacked ensemble of predictive models, showing that the potentially simple form of the meta-learner at the top of the stack enables a procedure with manageable computational cost that achieves approximate marginal validity without requiring the use of a separate calibration sample. Empirical results indicate that the method compares favorably to a standard inductive alternative.

[LG-182] Hamiltonian Descent Algorithms for Optimization: Accelerated Rates via Randomized Integration Time

链接: https://arxiv.org/abs/2505.12553
作者: Qiang Fu,Andre Wibisono
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the Hamiltonian flow for optimization (HF-opt), which simulates the Hamiltonian dynamics for some integration time and resets the velocity to 0 to decrease the objective function; this is the optimization analogue of the Hamiltonian Monte Carlo algorithm for sampling. For short integration time, HF-opt has the same convergence rates as gradient descent for minimizing strongly and weakly convex functions. We show that by randomizing the integration time in HF-opt, the resulting randomized Hamiltonian flow (RHF) achieves accelerated convergence rates in continuous time, similar to the rates for the accelerated gradient flow. We study a discrete-time implementation of RHF as the randomized Hamiltonian gradient descent (RHGD) algorithm. We prove that RHGD achieves the same accelerated convergence rates as Nesterov’s accelerated gradient descent (AGD) for minimizing smooth strongly and weakly convex functions. We provide numerical experiments to demonstrate that RHGD is competitive with classical accelerated methods such as AGD across all settings and outperforms them in certain regimes.

[LG-183] Nonlinear Laplacians: Tunable principal component analysis under directional prior information

链接: https://arxiv.org/abs/2505.12528
作者: Yuxin Ma,Dmitriy Kunisky
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 54 pages, 5 figures

点击查看摘要

Abstract:We introduce a new family of algorithms for detecting and estimating a rank-one signal from a noisy observation under prior information about that signal’s direction, focusing on examples where the signal is known to have entries biased to be positive. Given a matrix observation \mathbfY , our algorithms construct a nonlinear Laplacian, another matrix of the form \mathbfY + \mathrmdiag(\sigma(\mathbfY\mathbf1)) for a nonlinear \sigma: \mathbbR \to \mathbbR , and examine the top eigenvalue and eigenvector of this matrix. When \mathbfY is the (suitably normalized) adjacency matrix of a graph, our approach gives a class of algorithms that search for unusually dense subgraphs by computing a spectrum of the graph “deformed” by the degree profile \mathbfY\mathbf1 . We study the performance of such algorithms compared to direct spectral algorithms (the case \sigma = 0 ) on models of sparse principal component analysis with biased signals, including the Gaussian planted submatrix problem. For such models, we rigorously characterize the critical threshold strength of rank-one signal, as a function of the nonlinearity \sigma , at which an outlier eigenvalue appears in the spectrum of a nonlinear Laplacian. While identifying the \sigma that minimizes this critical signal strength in closed form seems intractable, we explore three approaches to design \sigma numerically: exhaustively searching over simple classes of \sigma , learning \sigma from datasets of problem instances, and tuning \sigma using black-box optimization of the critical signal strength. We find both theoretically and empirically that, if \sigma is chosen appropriately, then nonlinear Laplacian spectral algorithms substantially outperform direct spectral algorithms, while avoiding the complexity of broader classes of algorithms like approximate message passing or general first order methods.

[LG-184] Efficient Implementation of Gaussian Process Regression Accelerated Saddle Point Searches with Application to Molecular Reactions

链接: https://arxiv.org/abs/2505.12519
作者: Rohit Goswami(1),Maxim Masterov(2),Satish Kamath(2),Alejandro Peña-Torres(1),Hannes Jónsson(1) ((1) Science Institute and Faculty of Physical Sciences, University of Iceland, Reykjavík, Iceland, (2) SURF, Amsterdam, The Netherlands)
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:The task of locating first order saddle points on high-dimensional surfaces describing the variation of energy as a function of atomic coordinates is an essential step for identifying the mechanism and estimating the rate of thermally activated events within the harmonic approximation of transition state theory. When combined directly with electronic structure calculations, the number of energy and atomic force evaluations needed for convergence is a primary issue. Here, we describe an efficient implementation of Gaussian process regression (GPR) acceleration of the minimum mode following method where a dimer is used to estimate the lowest eigenmode of the Hessian. A surrogate energy surface is constructed and updated after each electronic structure calculation. The method is applied to a test set of 500 molecular reactions previously generated by Hermez and coworkers [J. Chem. Theory Comput. 18, 6974 (2022)]. An order of magnitude reduction in the number of electronic structure calculations needed to reach the saddle point configurations is obtained by using the GPR compared to the dimer method. Despite the wide range in stiffness of the molecular degrees of freedom, the calculations are carried out using Cartesian coordinates and are found to require similar number of electronic structure calculations as an elaborate internal coordinate method implemented in the Sella software package. The present implementation of the GPR surrogate model in C++ is efficient enough for the wall time of the saddle point searches to be reduced in 3 out of 4 cases even though the calculations are carried out at a low Hartree-Fock level.

[LG-185] Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables

链接: https://arxiv.org/abs/2505.12473
作者: Yu Gui,Cong Ma,Zongming Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Multi-modal contrastive learning as a self-supervised representation learning technique has achieved great success in foundation model training, such as CLIP~\citepradford2021learning. In this paper, we study the theoretical properties of the learned representations from multi-modal contrastive learning beyond linear representations and specific data distributions. Our analysis reveals that, enabled by temperature optimization, multi-modal contrastive learning not only maximizes mutual information between modalities but also adapts to intrinsic dimensions of data, which can be much lower than user-specified dimensions for representation vectors. Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance.

[LG-186] Wasserstein Barycenter Gaussian Process based Bayesian Optimization

链接: https://arxiv.org/abs/2505.12471
作者: Antonio Candelieri,Andrea Ponti,Francesco Archetti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Gaussian Process based Bayesian Optimization is a widely applied algorithm to learn and optimize under uncertainty, well-known for its sample efficiency. However, recently – and more frequently – research studies have empirically demonstrated that the Gaussian Process fitting procedure at its core could be its most relevant weakness. Fitting a Gaussian Process means tuning its kernel’s hyperparameters to a set of observations, but the common Maximum Likelihood Estimation technique, usually appropriate for learning tasks, has shown different criticalities in Bayesian Optimization, making theoretical analysis of this algorithm an open challenge. Exploiting the analogy between Gaussian Processes and Gaussian Distributions, we present a new approach which uses a prefixed set of hyperparameters values to fit as many Gaussian Processes and then combines them into a unique model as a Wasserstein Barycenter of Gaussian Processes. We considered both “easy” test problems and others known to undermine the \textitvanilla Bayesian Optimization algorithm. The new method, namely Wasserstein Barycenter Gausssian Process based Bayesian Optimization (WBGP-BO), resulted promising and able to converge to the optimum, contrary to vanilla Bayesian Optimization, also on the most “tricky” test problems.

[LG-187] High-Dimensional Dynamic Covariance Models with Random Forests

链接: https://arxiv.org/abs/2505.12444
作者: Shuguang Yu,Fan Zhou,Yingjie Zhang,Ziqi Chen,Hongtu Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel nonparametric method for estimating high-dimensional dynamic covariance matrices with multiple conditioning covariates, leveraging random forests and supported by robust theoretical guarantees. Unlike traditional static methods, our dynamic nonparametric covariance models effectively capture distributional heterogeneity. Furthermore, unlike kernel-smoothing methods, which are restricted to a single conditioning covariate, our approach accommodates multiple covariates in a fully nonparametric framework. To the best of our knowledge, this is the first method to use random forests for estimating high-dimensional dynamic covariance matrices. In high-dimensional settings, we establish uniform consistency theory, providing nonasymptotic error rates and model selection properties, even when the response dimension grows sub-exponentially with the sample size. These results hold uniformly across a range of conditioning variables. The method’s effectiveness is demonstrated through simulations and a stock dataset analysis, highlighting its ability to model complex dynamics in high-dimensional scenarios.

[LG-188] raining Latent Diffusion Models with Interacting Particle Algorithms

链接: https://arxiv.org/abs/2505.12412
作者: Tim Y. J. Wang,Juan Kuntz,O. Deniz Akyildiz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel particle-based algorithm for end-to-end training of latent diffusion models. We reformulate the training task as minimizing a free energy functional and obtain a gradient flow that does so. By approximating the latter with a system of interacting particles, we obtain the algorithm, which we underpin it theoretically by providing error guarantees. The novel algorithm compares favorably in experiments with previous particle-based methods and variational inference analogues.

[LG-189] Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold Method ICML2025

链接: https://arxiv.org/abs/2505.12378
作者: Andi Han,Pierre-Louis Poirion,Akiko Takeda
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:Optimization with orthogonality constraints frequently arises in various fields such as machine learning. Riemannian optimization offers a powerful framework for solving these problems by equipping the constraint set with a Riemannian manifold structure and performing optimization intrinsically on the manifold. This approach typically involves computing a search direction in the tangent space and updating variables via a retraction operation. However, as the size of the variables increases, the computational cost of the retraction can become prohibitively high, limiting the applicability of Riemannian optimization to large-scale problems. To address this challenge and enhance scalability, we propose a novel approach that restricts each update on a random submanifold, thereby significantly reducing the per-iteration complexity. We introduce two sampling strategies for selecting the random submanifolds and theoretically analyze the convergence of the proposed methods. We provide convergence results for general nonconvex functions and functions that satisfy Riemannian Polyak-Lojasiewicz condition as well as for stochastic optimization settings. Additionally, we demonstrate how our approach can be generalized to quotient manifolds derived from the orthogonal manifold. Extensive experiments verify the benefits of the proposed method, across a wide variety of problems.

[LG-190] rustworthy Image Super-Resolution via Generative Pseudoinverse

链接: https://arxiv.org/abs/2505.12375
作者: Andreas Floros,Seyed-Mohsen Moosavi-Dezfooli,Pier Luigi Dragotti
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of trustworthy image restoration, taking the form of a constrained optimization over the prior density. To this end, we develop generative models for the task of image super-resolution that respect the degradation process and that can be made asymptotically consistent with the low-resolution measurements, outperforming existing methods by a large margin in that respect.

[LG-191] LaPON: A Lagranges-mean-value-theorem-inspired operator network for solving PDEs and its application on NSE

链接: https://arxiv.org/abs/2505.12360
作者: Siwen Zhang,Xizeng Zhao,Zhengzhi Deng,Zhaoyuan Huang,Gang Tao,Nuo Xu,Zhouteng Ye
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accelerating the solution of nonlinear partial differential equations (PDEs) while maintaining accuracy at coarse spatiotemporal resolution remains a key challenge in scientific computing. Physics-informed machine learning (ML) methods such as Physics-Informed Neural Networks (PINNs) introduce prior knowledge through loss functions to ensure physical consistency, but their “soft constraints” are usually not strictly satisfied. Here, we propose LaPON, an operator network inspired by the Lagrange’s mean value theorem, which embeds prior knowledge directly into the neural network architecture instead of the loss function, making the neural network naturally satisfy the given constraints. This is a hybrid framework that combines neural operators with traditional numerical methods, where neural operators are used to compensate for the effect of discretization errors on the analytical scale in under-resolution simulations. As evaluated on turbulence problem modeled by the Navier-Stokes equations (NSE), the multiple time step extrapolation accuracy and stability of LaPON exceed the direct numerical simulation baseline at 8x coarser grids and 8x larger time steps, while achieving a vorticity correlation of more than 0.98 with the ground truth. It is worth noting that the model can be well generalized to unseen flow states, such as turbulence with different forcing, without retraining. In addition, with the same training data, LaPON’s comprehensive metrics on the out-of-distribution test set are at least approximately twice as good as two popular ML baseline methods. By combining numerical computing with machine learning, LaPON provides a scalable and reliable solution for high-fidelity fluid dynamics simulation, showing the potential for wide application in fields such as weather forecasting and engineering design.

[LG-192] WaLRUS: Wavelets for Long-range Representation Using SSMs NEURIPS2025

链接: https://arxiv.org/abs/2505.12161
作者: Hossein Babaei,Mel White,Sina Alemohammad,Richard G. Baraniuk
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 15 pages, 8 figures. Submitted to Neurips 2025

点击查看摘要

Abstract:State-Space Models (SSMs) have proven to be powerful tools for modeling long-range dependencies in sequential data. While the recent method known as HiPPO has demonstrated strong performance, and formed the basis for machine learning models S4 and Mamba, it remains limited by its reliance on closed-form solutions for a few specific, well-behaved bases. The SaFARi framework generalized this approach, enabling the construction of SSMs from arbitrary frames, including non-orthogonal and redundant ones, thus allowing an infinite diversity of possible “species” within the SSM family. In this paper, we introduce WaLRUS (Wavelets for Long-range Representation Using SSMs), a new implementation of SaFARi built from Daubechies wavelets.

[LG-193] -Rex: Fitting a Robust Factor Model via Expectation-Maximization

链接: https://arxiv.org/abs/2505.12117
作者: Daniel Cederberg
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注: Currently under review

点击查看摘要

Abstract:Over the past decades, there has been a surge of interest in studying low-dimensional structures within high-dimensional data. Statistical factor models - i.e., low-rank plus diagonal covariance structures - offer a powerful framework for modeling such structures. However, traditional methods for fitting statistical factor models, such as principal component analysis (PCA) or maximum likelihood estimation assuming the data is Gaussian, are highly sensitive to heavy tails and outliers in the observed data. In this paper, we propose a novel expectation-maximization (EM) algorithm for robustly fitting statistical factor models. Our approach is based on Tyler’s M-estimator of the scatter matrix for an elliptical distribution, and consists of solving Tyler’s maximum likelihood estimation problem while imposing a structural constraint that enforces the low-rank plus diagonal covariance structure. We present numerical experiments on both synthetic and real examples, demonstrating the robustness of our method for direction-of-arrival estimation in nonuniform noise and subspace recovery.

[LG-194] hompson Sampling-like Algorithms for Stochastic Rising Bandits

链接: https://arxiv.org/abs/2505.12092
作者: Marco Fiandri,Alberto Maria Metelli,Francesco Trovò
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 57 pages

点击查看摘要

Abstract:Stochastic rising rested bandit (SRRB) is a setting where the arms’ expected rewards increase as they are pulled. It models scenarios in which the performances of the different options grow as an effect of an underlying learning process (e.g., online model selection). Even if the bandit literature provides specifically crafted algorithms based on upper-confidence bounds for such a setting, no study about Thompson sampling TS-like algorithms has been performed so far. The strong regularity of the expected rewards in the SRRB setting suggests that specific instances may be tackled effectively using adapted and sliding-window TS approaches. This work provides novel regret analyses for such algorithms in SRRBs, highlighting the challenges and providing new technical tools of independent interest. Our results allow us to identify under which assumptions TS-like algorithms succeed in achieving sublinear regret and which properties of the environment govern the complexity of the regret minimization problem when approached with TS. Furthermore, we provide a regret lower bound based on a complexity index we introduce. Finally, we conduct numerical simulations comparing TS-like algorithms with state-of-the-art approaches for SRRBs in synthetic and real-world settings.

[LG-195] Multi-Attribute Graph Estimation with Sparse-Group Non-Convex Penalties

链接: https://arxiv.org/abs/2505.11984
作者: Jitendra K Tugnait
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 16 pages, 1 figure, 1 table, published in IEEE Access, vol. 13, pp. 80174-80190, 2025. arXiv admin note: text overlap with arXiv:2505.09748

点击查看摘要

Abstract:We consider the problem of inferring the conditional independence graph (CIG) of high-dimensional Gaussian vectors from multi-attribute data. Most existing methods for graph estimation are based on single-attribute models where one associates a scalar random variable with each node. In multi-attribute graphical models, each node represents a random vector. In this paper we provide a unified theoretical analysis of multi-attribute graph learning using a penalized log-likelihood objective function. We consider both convex (sparse-group lasso) and sparse-group non-convex (log-sum and smoothly clipped absolute deviation (SCAD) penalties) penalty/regularization functions. An alternating direction method of multipliers (ADMM) approach coupled with local linear approximation to non-convex penalties is presented for optimization of the objective function. For non-convex penalties, theoretical analysis establishing local consistency in support recovery, local convexity and precision matrix estimation in high-dimensional settings is provided under two sets of sufficient conditions: with and without some irrepresentability conditions. We illustrate our approaches using both synthetic and real-data numerical examples. In the synthetic data examples the sparse-group log-sum penalized objective function significantly outperformed the lasso penalized as well as SCAD penalized objective functions with F_1 -score and Hamming distance as performance metrics.

[LG-196] Improving the discovery of near-Earth objects with machine-learning methods

链接: https://arxiv.org/abs/2505.11910
作者: Peter Vereš,Richard Cloete,Matthew J. Payne,Abraham Loeb
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 13 pages, 16 figures, 11 tables

点击查看摘要

Abstract:We present a comprehensive analysis of the digest2 parameters for candidates of the Near-Earth Object Confirmation Page (NEOCP) that were reported between 2019 and 2024. Our study proposes methods for significantly reducing the inclusion of non-NEO objects on the NEOCP. Despite the substantial increase in near-Earth object (NEO) discoveries in recent years, only about half of the NEOCP candidates are ultimately confirmed as NEOs. Therefore, much observing time is spent following up on non-NEOs. Furthermore, approximately 11% of the candidates remain unconfirmed because the follow-up observations are insufficient. These are nearly 600 cases per year. To reduce false positives and minimize wasted resources on non-NEOs, we refine the posting criteria for NEOCP based on a detailed analysis of all digest2 scores. We investigated 30 distinct digest2 parameter categories for candidates that were confirmed as NEOs and non-NEOs. From this analysis, we derived a filtering mechanism based on selected digest2 parameters that were able to exclude 20% of the non-NEOs from the NEOCP while maintaining a minimal loss of true NEOs. We also investigated the application of four machine-learning (ML) techniques, that is, the gradient-boosting machine (GBM), the random forest (RF) classifier, the stochastic gradient descent (SGD) classifier, and neural networks (NN) to classify NEOCP candidates as NEOs or non-NEOs. Based on digest2 parameters as input, our ML models achieved a precision of approximately 95% in distinguishing between NEOs and non-NEOs. Results. Combining the digest2 parameter filter with an ML-based classification model, we demonstrate a significant reduction in non-NEOs on the NEOCP that exceeds 80%, while limiting the loss of NEO discovery tracklets to 5.5%. Importantly, we show that most follow-up tracklets of initially misclassified NEOs are later correctly identified as NEOs.

[LG-197] Measurement Score-Based Diffusion Model

链接: https://arxiv.org/abs/2505.11853
作者: Chicago Y. Park,Shirin Shoushtari,Hongyu An,Ulugbek S. Kamilov
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models are widely used in applications ranging from image generation to inverse problems. However, training diffusion models typically requires clean ground-truth images, which are unavailable in many applications. We introduce the Measurement Score-based diffusion Model (MSM), a novel framework that learns partial measurement scores using only noisy and subsampled measurements. MSM models the distribution of full measurements as an expectation over partial scores induced by randomized subsampling. To make the MSM representation computationally efficient, we also develop a stochastic sampling algorithm that generates full images by using a randomly selected subset of partial scores at each step. We additionally propose a new posterior sampling method for solving inverse problems that reconstructs images using these partial scores. We provide a theoretical analysis that bounds the Kullback-Leibler divergence between the distributions induced by full and stochastic sampling, establishing the accuracy of the proposed algorithm. We demonstrate the effectiveness of MSM on natural images and multi-coil MRI, showing that it can generate high-quality images and solve inverse problems – all without access to clean training data. Code is available at this https URL.

[LG-198] S-Crescendo: A Nested Transformer Weaving Framework for Scalable Nonlinear System in S-Domain Representation

链接: https://arxiv.org/abs/2505.11843
作者: Junlang Huang,Hao Chen,Li Luo,Yong Cai,Lexin Zhang,Tianhao Ma,Yitian Zhang,Zhong Guan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation of high-order nonlinear system requires extensive computational resources, especially in modern VLSI backend design where bifurcation-induced instability and chaos-like transient behaviors pose challenges. We present S-Crescendo - a nested transformer weaving framework that synergizes S-domain with neural operators for scalable time-domain prediction in high-order nonlinear networks, alleviating the computational bottlenecks of conventional solvers via Newton-Raphson method. By leveraging the partial-fraction decomposition of an n-th order transfer function into first-order modal terms with repeated poles and residues, our method bypasses the conventional Jacobian matrix-based iterations and efficiently reduces computational complexity from cubic O(n^3) to linear O(n) .The proposed architecture seamlessly integrates an S-domain encoder with an attention-based correction operator to simultaneously isolate dominant response and adaptively capture higher-order non-linearities. Validated on order-1 to order-10 networks, our method achieves up to 0.99 test-set ( R^2 ) accuracy against HSPICE golden waveforms and accelerates simulation by up to 18(X), providing a scalable, physics-aware framework for high-dimensional nonlinear modeling.

[LG-199] AnalyticKWS: Towards Exemplar-Free Analytic Class Incremental Learning for Small-footprint Keyword Spotting ACL2025

链接: https://arxiv.org/abs/2505.11817
作者: Yang Xiao,Tianyi Peng,Rohan Kumar Das,Yuchen Hu,Huiping Zhuang
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted by ACL 2025

点击查看摘要

Abstract:Keyword spotting (KWS) offers a vital mechanism to identify spoken commands in voice-enabled systems, where user demands often shift, requiring models to learn new keywords continually over time. However, a major problem is catastrophic forgetting, where models lose their ability to recognize earlier keywords. Although several continual learning methods have proven their usefulness for reducing forgetting, most existing approaches depend on storing and revisiting old data to combat catastrophic forgetting. Though effective, these methods face two practical challenges: 1) privacy risks from keeping user data and 2) large memory and time consumption that limit deployment on small devices. To address these issues, we propose an exemplar-free Analytic Continual Learning (AnalyticKWS) method that updates model parameters without revisiting earlier data. Inspired by efficient learning principles, AnalyticKWS computes a closed-form analytical solution for model updates and requires only a single epoch of adaptation for incoming keywords. AnalyticKWS demands fewer computational resources by avoiding gradient-based updates and does not store old data. By eliminating the need for back-propagation during incremental learning, the model remains lightweight and efficient. As a result, AnalyticKWS meets the challenges mentioned earlier and suits resource-limited settings well. Extensive experiments on various datasets and settings show that AnalyticKWS consistently outperforms existing continual learning methods.

[LG-200] Missing Data Imputation by Reducing Mutual Information with Rectified Flows

链接: https://arxiv.org/abs/2505.11749
作者: Jiahao Yu,Qizhen Ying,Leyang Wang,Ziyue Jiang,Song Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel iterative method for missing data imputation that sequentially reduces the mutual information between data and their corresponding missing mask. Inspired by GAN-based approaches, which train generators to decrease the predictability of missingness patterns, our method explicitly targets the reduction of mutual information. Specifically, our algorithm iteratively minimizes the KL divergence between the joint distribution of the imputed data and missing mask, and the product of their marginals from the previous iteration. We show that the optimal imputation under this framework corresponds to solving an ODE, whose velocity field minimizes a rectified flow training objective. We further illustrate that some existing imputation techniques can be interpreted as approximate special cases of our mutual-information-reducing framework. Comprehensive experiments on synthetic and real-world datasets validate the efficacy of our proposed approach, demonstrating superior imputation performance.

[LG-201] Explainable Machine Learning for Oxygen Diffusion in Perovskites and Pyrochlores

链接: https://arxiv.org/abs/2505.11722
作者: Grace M. Lu,Dallas R. Trinkle(Department of Materials Science and Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA)
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 32 pages, 11 figures

点击查看摘要

Abstract:Explainable machine learning can help to discover new physical relationships for material properties. To understand the material properties that govern the activation energy for oxygen diffusion in perovskites and pyrochlores, we build a database of experimental activation energies and apply a grouping algorithm to the material property features. These features are then used to fit seven different machine learning models. An ensemble consensus determines that the most important features for predicting the activation energy are the ionicity of the A-site bond and the partial pressure of oxygen for perovskites. For pyrochlores, the two most important features are the A-site s valence electron count and the B-site electronegativity. The most important features are all constructed using the weighted averages of elemental metal properties, despite weighted averages of the constituent binary oxides being included in our feature set. This is surprising because the material properties of the constituent oxides are more similar to the experimentally measured properties of perovskites and pyrochlores than the features of the metals that are chosen. The easy-to-measure features identified in this work enable rapid screening for new materials with fast oxide-ion diffusivity.

[LG-202] Humble your Overconfident Networks: Unlearning Overfitting via Sequential Monte Carlo Tempered Deep Ensembles

链接: https://arxiv.org/abs/2505.11671
作者: Andrew Millard,Zheng Zhao,Joshua Murphy,Simon Maskell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Sequential Monte Carlo (SMC) methods offer a principled approach to Bayesian uncertainty quantification but are traditionally limited by the need for full-batch gradient evaluations. We introduce a scalable variant by incorporating Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) proposals into SMC, enabling efficient mini-batch based sampling. Our resulting SMCSGHMC algorithm outperforms standard stochastic gradient descent (SGD) and deep ensembles across image classification, out-of-distribution (OOD) detection, and transfer learning tasks. We further show that SMCSGHMC mitigates overfitting and improves calibration, providing a flexible, scalable pathway for converting pretrained neural networks into well-calibrated Bayesian models.

[LG-203] he Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations

链接: https://arxiv.org/abs/2505.11622
作者: Michael L. Wells,Kamel Lahouel,Bruno Jedynak
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages, 1 figure, and 2 tables in main part of text. 30 pages, 2 figures, and 3 tables in full submission including technical appendices

点击查看摘要

Abstract:We present a novel kernel-based method for learning multivariate stochastic differential equations (SDEs). The method follows a two-step procedure: we first estimate the drift term function, then the (matrix-valued) diffusion function given the drift. Occupation kernels are integral functionals on a reproducing kernel Hilbert space (RKHS) that aggregate information over a trajectory. Our approach leverages vector-valued occupation kernels for estimating the drift component of the stochastic process. For diffusion estimation, we extend this framework by introducing operator-valued occupation kernels, enabling the estimation of an auxiliary matrix-valued function as a positive semi-definite operator, from which we readily derive the diffusion estimate. This enables us to avoid common challenges in SDE learning, such as intractable likelihoods, by optimizing a reconstruction-error-based objective. We propose a simple learning procedure that retains strong predictive accuracy while using Fenchel duality to promote efficiency. We validate the method on simulated benchmarks and a real-world dataset of Amyloid imaging in healthy and Alzheimer’s disease (AD) subjects.

信息检索

[IR-0] Optimizing Retrieval Augmented Generation for Object Constraint Language

链接: https://arxiv.org/abs/2505.13129
作者: Kevin Chenhao Li,Vahid Zolfaghari,Nenad Petrovic,Fengjunjie Pan,Alois Knoll
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The Object Constraint Language (OCL) is essential for defining precise constraints within Model-Based Systems Engineering (MBSE). However, manually writing OCL rules is complex and time-consuming. This study explores the optimization of Retrieval-Augmented Generation (RAG) for automating OCL rule generation, focusing on the impact of different retrieval strategies. We evaluate three retrieval approaches \unicodex2013 BM25 (lexical-based), BERT-based (semantic retrieval), and SPLADE (sparse-vector retrieval) \unicodex2013 analyzing their effectiveness in providing relevant context for a large language model. To further assess our approach, we compare and benchmark our retrieval-optimized generation results against PathOCL, a state-of-the-art graph-based method. We directly compare BM25, BERT, and SPLADE retrieval methods with PathOCL to understand how different retrieval methods perform for a unified evaluation framework. Our experimental results, focusing on retrieval-augmented generation, indicate that while retrieval can enhance generation accuracy, its effectiveness depends on the retrieval method and the number of retrieved chunks (k). BM25 underperforms the baseline, whereas semantic approaches (BERT and SPLADE) achieve better results, with SPLADE performing best at lower k values. However, excessive retrieval with high k parameter can lead to retrieving irrelevant chunks which degrades model performance. Our findings highlight the importance of optimizing retrieval configurations to balance context relevance and output consistency. This research provides insights into improving OCL rule generation using RAG and underscores the need for tailoring retrieval. Subjects: Information Retrieval (cs.IR); Software Engineering (cs.SE) Cite as: arXiv:2505.13129 [cs.IR] (or arXiv:2505.13129v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.13129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] owards A Generalist Code Embedding Model Based On Massive Data Synthesis

链接: https://arxiv.org/abs/2505.12697
作者: Chaofan Li,Jianlyu Chen,Yingxia Shao,Defu Lian,Zheng Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Code embedding models attract increasing attention due to the widespread popularity of retrieval-augmented generation (RAG) in software development. These models are expected to capture the rich semantic relationships inherent to code, which differ significantly from those found in text. However, existing models remain severely limited due to the scarcity of high-quality training data. In this work, we introduce \textbfCodeR (\underlineCode \underlineRetrieval), a state-of-the-art embedding model for general-purpose code retrieval. The superior performance of CodeR is built upon CodeR-Pile, a large-scale synthetic dataset constructed under the DRU (Diversity, Reliability, Usability) principle via a novel data synthesis pipeline. To optimize training effectiveness, we propose Annealing, a curriculum learning strategy that enables effective knowledge transfer across heterogeneous sources of data. We evaluate CodeR based on 16 diverse code retrieval tasks, where it significantly outperforms existing baselines and exhibits strong out-of-domain generalization performance. We have publicly released our code and the well-trained model to facilitate further research in this critical area. this https URL.

[IR-2] LLM -based Query Expansion Fails for Unfamiliar and Ambiguous Queries SIGIR2025

链接: https://arxiv.org/abs/2505.12694
作者: Kenya Abe,Kunihiro Takeoka,Makoto P. Kato,Masafumi Oyamada
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025 short paper track

点击查看摘要

Abstract:Query expansion (QE) enhances retrieval by incorporating relevant terms, with large language models (LLMs) offering an effective alternative to traditional rule-based and statistical methods. However, LLM-based QE suffers from a fundamental limitation: it often fails to generate relevant knowledge, degrading search performance. Prior studies have focused on hallucination, yet its underlying cause–LLM knowledge deficiencies–remains underexplored. This paper systematically examines two failure cases in LLM-based QE: (1) when the LLM lacks query knowledge, leading to incorrect expansions, and (2) when the query is ambiguous, causing biased refinements that narrow search coverage. We conduct controlled experiments across multiple datasets, evaluating the effects of knowledge and query ambiguity on retrieval performance using sparse and dense retrieval models. Our results reveal that LLM-based QE can significantly degrade the retrieval effectiveness when knowledge in the LLM is insufficient or query ambiguity is high. We introduce a framework for evaluating QE under these conditions, providing insights into the limitations of LLM-based retrieval augmentation.

[IR-3] PoisonArena: Uncovering Competing Poisoning Attacks in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2505.12574
作者: Liuji Chen,Xiaofang Yang,Yuanzhuo Lu,Jinghao Zhang,Xin Sun,Qiang Liu,Shu Wu,Jing Dong,Liang Wang
类目: Information Retrieval (cs.IR)
*备注: 29 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems, widely used to improve the factual grounding of large language models (LLMs), are increasingly vulnerable to poisoning attacks, where adversaries inject manipulated content into the retriever’s corpus. While prior research has predominantly focused on single-attacker settings, real-world scenarios often involve multiple, competing attackers with conflicting objectives. In this work, we introduce PoisonArena, the first benchmark to systematically study and evaluate competing poisoning attacks in RAG. We formalize the multi-attacker threat model, where attackers vie to control the answer to the same query using mutually exclusive misinformation. PoisonArena leverages the Bradley-Terry model to quantify each method’s competitive effectiveness in such adversarial environments. Through extensive experiments on the Natural Questions and MS MARCO datasets, we demonstrate that many attack strategies successful in isolation fail under competitive pressure. Our findings highlight the limitations of conventional evaluation metrics like Attack Success Rate (ASR) and F1 score and underscore the need for competitive evaluation to assess real-world attack robustness. PoisonArena provides a standardized framework to benchmark and develop future attack and defense strategies under more realistic, multi-adversary conditions. Project page: this https URL.

[IR-4] Batched Self-Consistency Improves LLM Relevance Assessment and Ranking

链接: https://arxiv.org/abs/2505.12570
作者: Anton Korikov,Pan Du,Scott Sanner,Navid Rekabsaz
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Given some information need, Large Language Models (LLMs) are increasingly used for candidate text relevance assessment, typically using a one-by-one pointwise (PW) strategy where each LLM call evaluates one candidate at a time. Meanwhile, it has been shown that LLM performance can be improved through self-consistency: prompting the LLM to do the same task multiple times (possibly in perturbed ways) and then aggregating the responses. To take advantage of self-consistency, we hypothesize that batched PW strategies, where multiple passages are judged in one LLM call, are better suited than one-by-one PW methods since a larger input context can induce more diverse LLM sampling across self-consistency calls. We first propose several candidate batching strategies to create prompt diversity across self-consistency calls through subset reselection and permutation. We then test our batched PW methods on relevance assessment and ranking tasks against one-by-one PW and listwise LLM ranking baselines with and without self-consistency, using three passage retrieval datasets and GPT-4o, Claude Sonnet 3, and Amazon Nova Pro. We find that batched PW methods outperform all baselines, and show that batching can greatly amplify the positive effects of self-consistency. For instance, on our legal search dataset, GPT-4o one-by-one PW ranking NDCG@10 improves only from 44.9% to 46.8% without self-consistency vs. with 15 self consistency calls, while batched PW ranking improves from 43.8% to 51.3%, respectively.

[IR-5] LLM -CoT Enhanced Graph Neural Recommendation with Harmonized Group Policy Optimization

链接: https://arxiv.org/abs/2505.12396
作者: Hailong Luo,Bin Wu,Hongyong Jia,Qingqing Zhu,Lianlei Shan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have advanced recommender systems by modeling interaction relationships. However, existing graph-based recommenders rely on sparse ID features and do not fully exploit textual information, resulting in low information density within representations. Furthermore, graph contrastive learning faces challenges. Random negative sampling can introduce false negative samples, while fixed temperature coefficients cannot adapt to the heterogeneity of different nodes. In addition, current efforts to enhance recommendations with large language models (LLMs) have not fully utilized their Chain-of-Thought (CoT) reasoning capabilities to guide representation learning. To address these limitations, we introduces LGHRec (LLM-CoT Enhanced Graph Neural Recommendation with Harmonized Group Policy Optimization). This framework leverages the CoT reasoning ability of LLMs to generate semantic IDs, enriching reasoning processes and improving information density and semantic quality of representations. Moreover, we design a reinforcement learning algorithm, Harmonized Group Policy Optimization (HGPO), to optimize negative sampling strategies and temperature coefficients in contrastive learning. This approach enhances long-tail recommendation performance and ensures optimization consistency across different groups. Experimental results on three datasets demonstrate that LGHRec improves representation quality through semantic IDs generated by LLM’s CoT reasoning and effectively boosts contrastive learning with HGPO. Our method outperforms several baseline models. The code is available at: this https URL.

[IR-6] Addressing Missing Data Issue for Diffusion-based Recommendation

链接: https://arxiv.org/abs/2505.12283
作者: Wenyu Mao,Zhengyi Yang,Jiancan Wu,Haozhe Liu,Yancheng Yuan,Xiang Wang,Xiangnan He
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Diffusion models have shown significant potential in generating oracle items that best match user preference with guidance from user historical interaction sequences. However, the quality of guidance is often compromised by unpredictable missing data in observed sequence, leading to suboptimal item generation. Since missing data is uncertain in both occurrence and content, recovering it is impractical and may introduce additional errors. To tackle this challenge, we propose a novel dual-side Thompson sampling-based Diffusion Model (TDM), which simulates extra missing data in the guidance signals and allows diffusion models to handle existing missing data through extrapolation. To preserve user preference evolution in sequences despite extra missing data, we introduce Dual-side Thompson Sampling to implement simulation with two probability models, sampling by exploiting user preference from both item continuity and sequence stability. TDM strategically removes items from sequences based on dual-side Thompson sampling and treats these edited sequences as guidance for diffusion models, enhancing models’ robustness to missing data through consistency regularization. Additionally, to enhance the generation efficiency, TDM is implemented under the denoising diffusion implicit models to accelerate the reverse process. Extensive experiments and theoretical analysis validate the effectiveness of TDM in addressing missing data in sequential recommendations.

[IR-7] A Survey on Side Information-driven Session-based Recommendation: From a Data-centric Perspective

链接: https://arxiv.org/abs/2505.12279
作者: Xiaokun Zhang,Bo Xu,Chenliang Li,Bowei He,Hongfei Lin,Chen Ma,Fenglong Ma
类目: Information Retrieval (cs.IR)
*备注: This work has been accepted by IEEE TKDE as a survey paper

点击查看摘要

Abstract:Session-based recommendation is gaining increasing attention due to its practical value in predicting the intents of anonymous users based on limited behaviors. Emerging efforts incorporate various side information to alleviate inherent data scarcity issues in this task, leading to impressive performance improvements. The core of side information-driven session-based recommendation is the discovery and utilization of diverse data. In this survey, we provide a comprehensive review of this task from a data-centric perspective. Specifically, this survey commences with a clear formulation of the task. This is followed by a detailed exploration of various benchmarks rich in side information that are pivotal for advancing research in this field. Afterwards, we delve into how different types of side information enhance the task, underscoring data characteristics and utility. Moreover, we discuss the usage of various side information, including data encoding, data injection, and involved techniques. A systematic review of research progress is then presented, with the taxonomy by the types of side information. Finally, we summarize the current limitations and present the future prospects of this vibrant topic.

[IR-8] Scalable Time-Tagged Data Acquisition for Entanglement Distribution in Quantum Networks

链接: https://arxiv.org/abs/2505.12102
作者: Abderrahim Amlou,Thomas Gerrits,Anouar Rahmouni,Amar Abane,Mheni Merzouki,Ya-Shian Li-Baboud,Ahmed Lbath,Abdella Battou,Oliver Slattery
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:In distributed quantum applications such as entanglement distribution, precise time synchronization and efficient time-tagged data handling are essential. Traditional systems often suffer from overflow, synchronization drift, and storage inefficiencies. We propose a modular Time Tagging (TT) agent that uses a 1 pulse per second (PPS) signal from White Rabbit (WR) devices to achieve network-wide synchronization, while applying real-time calibration, overflow mitigation, and compression. A live two-lab entanglement distribution experiment validated the system’s performance, achieving synchronized coincidence detection at 25,000 counts/sec.

[IR-9] Basic model for ranking microfinance institutions

链接: https://arxiv.org/abs/2505.11944
作者: Dmitry Dudukalov,Evgeny Prokopenko
类目: Information Retrieval (cs.IR); Probability (math.PR); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:This paper discusses the challenges encountered in building a ranking model for aggregator site products, using the example of ranking microfinance institutions (MFIs) based on post-click conversion. We suggest which features of MFIs should be considered, and using an algorithm based on Markov chains, we demonstrate the ``usefulness’’ of these features on real data. The ideas developed in this work can be applied to aggregator websites in microinsurance, especially when personal data is unavailable. Since we did not find similar datasets in the public domain, we are publishing our dataset with a detailed description of its attributes.

[IR-10] co-oRAG : Optimizing Retrieval-augmented Generation for Telecom Queries via Hybrid Retrieval and Neural Routing

链接: https://arxiv.org/abs/2505.11856
作者: Andrei-Laurentiu Bornea,Fadhel Ayed,Antonio De Domenico,Nicola Piovesan,Tareq Si Salem,Ali Maatouk
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Artificial intelligence will be one of the key pillars of the next generation of mobile networks (6G), as it is expected to provide novel added-value services and improve network performance. In this context, large language models have the potential to revolutionize the telecom landscape through intent comprehension, intelligent knowledge retrieval, coding proficiency, and cross-domain orchestration capabilities. This paper presents Telco-oRAG, an open-source Retrieval-Augmented Generation (RAG) framework optimized for answering technical questions in the telecommunications domain, with a particular focus on 3GPP standards. Telco-oRAG introduces a hybrid retrieval strategy that combines 3GPP domain-specific retrieval with web search, supported by glossary-enhanced query refinement and a neural router for memory-efficient retrieval. Our results show that Telco-oRAG improves the accuracy in answering 3GPP-related questions by up to 17.6% and achieves a 10.6% improvement in lexicon queries compared to baselines. Furthermore, Telco-oRAG reduces memory usage by 45% through targeted retrieval of relevant 3GPP series compared to baseline RAG, and enables open-source LLMs to reach GPT-4-level accuracy on telecom benchmarks.

[IR-11] he Effects of Demographic Instructions on LLM Personas SIGIR’25

链接: https://arxiv.org/abs/2505.11795
作者: Angel Felipe Magnossão de Paula,J. Shane Culpepper,Alistair Moffat,Sachin Pathiyan Cherumanal,Falk Scholer,Johanne Trippas
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR’25, Padua, Italy

点击查看摘要

Abstract:Social media platforms must filter sexist content in compliance with governmental regulations. Current machine learning approaches can reliably detect sexism based on standardized definitions, but often neglect the subjective nature of sexist language and fail to consider individual users’ perspectives. To address this gap, we adopt a perspectivist approach, retaining diverse annotations rather than enforcing gold-standard labels or their aggregations, allowing models to account for personal or group-specific views of sexism. Using demographic data from Twitter, we employ large language models (LLMs) to personalize the identification of sexism.

[IR-12] rminators: Terms of Service Parsing and Auditing Agents

链接: https://arxiv.org/abs/2505.11672
作者: Maruf Ahmed Mridul,Inwon Kang,Oshani Seneviratne
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Terms of Service (ToS) documents are often lengthy and written in complex legal language, making them difficult for users to read and understand. To address this challenge, we propose Terminators, a modular agentic framework that leverages large language models (LLMs) to parse and audit ToS documents. Rather than treating ToS understanding as a black-box summarization problem, Terminators breaks the task down to three interpretable steps: term extraction, verification, and accountability planning. We demonstrate the effectiveness of our method on the OpenAI ToS using GPT-4o, highlighting strategies to minimize hallucinations and maximize auditability. Our results suggest that structured, agent-based LLM workflows can enhance both the usability and enforceability of complex legal documents. By translating opaque terms into actionable, verifiable components, Terminators promotes ethical use of web content by enabling greater transparency, empowering users to understand their digital rights, and supporting automated policy audits for regulatory or civic oversight.

附件下载

点击下载今日全部论文列表