Arxiv今日论文 | 2025-02-06

本篇博文主要内容为 2025-02-06 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在部署过程中可靠性评估不足的问题。当前的基准测试主要关注模型的能力而忽视了其可靠性。论文的关键解决方案是提出“铂金基准”（platinum benchmarks）的概念，这些基准经过精心设计以最小化标签错误和歧义，从而更准确地量化模型的可靠性。通过修订十五个现有流行基准中的示例，作者验证了前沿LLMs在简单任务上的持续失败，并揭示了先前未被识别的模型弱点。

链接: https://arxiv.org/abs/2502.03461
作者: Joshua Vendrow,Edward Vendrow,Sara Beery,Aleksander Madry
机构: MIT
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs’ growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2502.03461 [cs.LG] (or arXiv:2502.03461v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.03461 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-1] Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training

【速读】：该论文旨在解决小语言模型（SLMs）在边缘设备上的应用需求与高性能模型获取之间的矛盾。传统方法要么从头预训练模型，这需要巨大的计算资源；要么压缩或剪枝现有的大型语言模型（LLMs），但会导致性能下降且效果不如从头预训练。论文的关键解决方案在于提出了一种结合结构化剪枝和模型训练的方法——层自适应剪枝（Adapt-Pruner）。该方法不仅在大型语言模型中表现出色，而且通过进一步的训练可以达到与从头预训练相媲美的效果，并通过逐步移除少量神经元实现增量剪枝，从而带来非平凡的性能提升。实验结果表明，Adapt-Pruner在常识基准测试中比传统剪枝方法平均提高了1%-7%的准确率，并且在MMLU基准测试中通过剪枝恢复了MobileLLM-125M到600M的性能，同时发现了一个超越LLaMA-3.2-1B的新1B模型。

链接: https://arxiv.org/abs/2502.03460
作者: Boyao Wang,Rui Pan,Shizhe Diao,Xingyuan Pan,Jipeng Zhang,Renjie Pi,Tong Zhang
机构: University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校); HKUST(香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ( \sim 5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200 \times fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks.
zh

[NLP-2] On Fairness of Unified Multimodal Large Language Model for Image Generation

【速读】：该论文旨在解决统一多模态大型语言模型（U-MLLMs）在输出中表现出显著的人口统计学偏差（demographic biases），如性别和种族偏差的问题。论文的关键解决方案是一种定位后修复（locate-then-fix）策略，通过审计发现这些偏差主要源自语言模型。进一步地，作者提出了一个新颖的平衡偏好模型（balanced preference model），以合成数据的方式平衡人口统计学分布，从而在减少偏差的同时保持语义保真度（semantic fidelity）。

链接: https://arxiv.org/abs/2502.03429
作者: Ming Liu,Hao Chen,Jindong Wang,Liwen Wang,Bhiksha Raj Ramakrishnan,Wensheng Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in visual understanding and generation in an end-to-end pipeline. Compared with generation-only models (e.g., Stable Diffusion), U-MLLMs may raise new questions about bias in their outputs, which can be affected by their unified capabilities. This gap is particularly concerning given the under-explored risk of propagating harmful stereotypes. In this paper, we benchmark the latest U-MLLMs and find that most exhibit significant demographic biases, such as gender and race bias. To better understand and mitigate this issue, we propose a locate-then-fix strategy, where we audit and show how the individual model component is affected by bias. Our analysis shows that bias originates primarily from the language model. More interestingly, we observe a “partial alignment” phenomenon in U-MLLMs, where understanding bias appears minimal, but generation bias remains substantial. Thus, we propose a novel balanced preference model to balance the demographic distribution with synthetic data. Experiments demonstrate that our approach reduces demographic bias while preserving semantic fidelity. We hope our findings underscore the need for more holistic interpretation and debiasing strategies of U-MLLMs in the future.
zh

[NLP-3] hink or Step-by-Step? UnZIPping the Black Box in Zero-Shot Prompts

【速读】：该论文旨在解决零样本提示技术在大型语言模型（LLMs）中的有效性理解不足的问题。关键在于引入ZIP评分（零样本扰动重要性评分），这是一种基于系统化输入词扰动的通用度量方法，适用于开源和闭源模型。通过实验发现不同词汇在不同模型和任务中的重要性差异，并验证了该方法的有效性。

链接: https://arxiv.org/abs/2502.03418
作者: Nikta Gohari Sadr,Sangmitra Madhusudan,Ali Emami
机构: Brock University (布鲁克大学), St. Catharines, Canada
类目: Computation and Language (cs.CL)
备注: 8 pages (excluding references)

点击查看摘要

Abstract:Zero-shot prompting techniques have significantly improved the performance of Large Language Models (LLMs). However, we lack a clear understanding of why zero-shot prompts are so effective. For example, in the prompt “Let’s think step-by-step,” is “think” or “step-by-step” more crucial to its success? Existing interpretability methods, such as gradient-based and attention-based approaches, are computationally intensive and restricted to open-source models. We introduce the ZIP score (Zero-shot Importance of Perturbation score), a versatile metric applicable to both open and closed-source models, based on systematic input word perturbations. Our experiments across four recent LLMs, seven widely-used prompts, and several tasks, reveal interesting patterns in word importance. For instance, while both ‘step-by-step’ and ‘think’ show high ZIP scores, which one is more influential depends on the model and task. We validate our method using controlled experiments and compare our results with human judgments, finding that proprietary models align more closely with human intuition regarding word significance. These findings enhance our understanding of LLM behavior and contribute to developing more effective zero-shot prompts and improved model analysis.
zh

[NLP-4] SPRI: Aligning Large Language Models with Context-Situated Principles

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂任务中与人类价值观对齐的难题，特别是那些需要细致人类监督的任务。这类对齐工作通常依赖于耗时且资源密集型的人类专业知识来提供特定上下文的指导。以往的研究采用预定义规则或原则来引导模型行为（Bai et al., 2022; Sun et al., 2023），但这些规则往往过于通用，难以适应每个具体的输入查询或上下文。为了解决这些问题，论文提出了一种名为Situated-PRInciples (SPRI) 的框架，其关键在于能够以最少或无需人工干预的方式自动生成针对每个输入查询的实时指导原则，并利用这些原则使模型响应与人类价值观更好地对齐。通过在三项任务中的评估，证明了SPRI不仅能生成与专家设计相媲美的原则，还能产生实例特定的评分标准，从而在真实性方面带来显著提升。

链接: https://arxiv.org/abs/2502.03397
作者: Hongli Zhan,Muneeza Azmat,Raya Horesh,Junyi Jessy Li,Mikhail Yurochkin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at this https URL.
zh

[NLP-5] LIMO: Less is More for Reasoning

【速读】：该论文旨在挑战关于复杂推理任务需要大量训练数据（100,000个样本）的传统观念。论文的关键在于提出并验证了“Less-Is-More推理假设”(LIMO假设)，表明在预训练期间全面编码领域知识的基础模型中，通过少量但精心设计的演示，可以有效激发复杂的推理能力。论文中的解决方案展示了LIMO模型仅使用817个精心策划的训练样本，在AIME和MATH测试集上分别达到了57.1%和94.8%的准确率，显著优于使用100倍数据的传统方法。这表明少量高质量的训练数据足以使模型具备出色的泛化能力，而非仅仅记忆训练数据。

链接: https://arxiv.org/abs/2502.03387
作者: Yixin Ye,Zhen Huang,Yang Xiao,Ethan Chern,Shijie Xia,Pengfei Liu
机构: SJTU(SJTU)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models’ 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model’s encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as “cognitive templates” that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at this https URL.
zh

[NLP-6] High-Fidelity Simultaneous Speech-To-Speech Translation

【速读】：该论文旨在解决同时传译（Simultaneous Speech Translation）中的根本挑战，即如何在不等待源语言发言结束的情况下，实时逐段积累足够的上下文信息以生成准确翻译。为实现这一目标，论文提出的关键解决方案是引入了一种弱监督方法，该方法利用现成文本翻译系统的困惑度（perplexity）来识别每词最优延迟，从而创建对齐的合成数据。通过这种方式，Hibiki模型能够在监督训练后采用自适应策略进行同时语音翻译，并且其简单的推理过程使其适用于批量翻译及设备端实时部署。

链接: https://arxiv.org/abs/2502.03382
作者: Tom Labiausse,Laurent Mazaré,Edouard Grave,Patrick Pérez,Alexandre Défossez,Neil Zeghidour
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.
zh

[NLP-7] Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality

【速读】：该论文旨在探讨自动语音识别（Automatic Speech Recognition, ASR）技术在远程医疗口译场景中的影响。研究通过四个随机条件的被试内实验设计，利用模拟的医疗会诊来仿真对话口译任务，并收集了四名中英双语受训口译员的数据。关键解决方案在于提供不同类型的ASR输出，特别是完整的ASR转录文本和基于ASR生成的ChatGPT摘要，这些有效提升了口译质量，并改变了口译错误类型的分布。初步数据显示，参与者更偏好使用完整的ASR转录文本。

链接: https://arxiv.org/abs/2502.03381
作者: Shiyi Tan,Constantin Orăsan,Sabine Braun
机构: 未知
类目: Computation and Language (cs.CL)
备注: to appear in the Proceedings of Translation and the Computer (TC46)

点击查看摘要

Abstract:This paper reports on the results from a pilot study investigating the impact of automatic speech recognition (ASR) technology on interpreting quality in remote healthcare interpreting settings. Employing a within-subjects experiment design with four randomised conditions, this study utilises scripted medical consultations to simulate dialogue interpreting tasks. It involves four trainee interpreters with a language combination of Chinese and English. It also gathers participants’ experience and perceptions of ASR support through cued retrospective reports and semi-structured interviews. Preliminary data suggest that the availability of ASR, specifically the access to full ASR transcripts and to ChatGPT-generated summaries based on ASR, effectively improved interpreting quality. Varying types of ASR output had different impacts on the distribution of interpreting error types. Participants reported similar interactive experiences with the technology, expressing their preference for full ASR transcripts. This pilot study shows encouraging results of applying ASR to dialogue-based healthcare interpreting and offers insights into the optimal ways to present ASR output to enhance interpreter experience and performance. However, it should be emphasised that the main purpose of this study was to validate the methodology and that further research with a larger sample size is necessary to confirm these findings.
zh

[NLP-8] Demystifying Long Chain-of-Thought Reasoning in LLM s

【速读】：该论文旨在探究长链条思维（Long Chains-of-Thought, CoTs）在大规模语言模型（LLMs）中的生成机制及其影响因素。论文通过系统的监督微调（SFT）和强化学习（RL）实验，识别出能够使模型生成长链条思维轨迹的关键因素。解决方案的关键在于：(1) 监督微调虽非必需，但有助于简化训练过程并提高效率；(2) 训练计算量的增加有利于推理能力的发展，但需通过奖励塑造来稳定链条长度的增长；(3) 奖励信号的可验证性扩展对于强化学习至关重要，利用带有过滤机制的噪声网络提取解显示出巨大潜力，尤其是在处理分布外（OOD）任务如STEM推理时；(4) 基础模型已具备纠错等核心能力，但有效激励这些技能以应对复杂任务需要大量的计算资源，并且其发展需要精细测量。这些见解为优化训练策略以增强LLMs中的长链条思维推理提供了实用指导。

链接: https://arxiv.org/abs/2502.03373
作者: Edward Yeo,Yuxuan Tong,Morry Niu,Graham Neubig,Xiang Yue
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint, under review

点击查看摘要

Abstract:Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: this https URL.
zh

[NLP-9] Minerva: A Programmable Memory Test Benchmark for Language Models

【速读】：该论文旨在解决传统数据基准测试在评估大型语言模型（LLM）基于记忆（上下文）执行任务的有效性方面存在的局限性。这些局限包括静态性质、过拟合倾向、难以解读以及缺乏可操作的洞察力，从而无法明确指出模型在未通过测试时具体缺乏哪些能力。论文的关键解决方案在于提出一个自动化的框架，用于生成全面的测试集，以评估模型有效利用记忆的能力，并扩展了能力测试的范围，涵盖了搜索、回忆、编辑、匹配及比较上下文内存中的信息等原子任务，同时还设计了复合测试来探究模型在操作内存时保持状态的能力。

链接: https://arxiv.org/abs/2502.03358
作者: Menglin Xia,Victor Ruehle,Saravan Rajmohan,Reza Shokri
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights–failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models’ abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. Additionally, we design composite tests to investigate the models’ ability to maintain state while operating on memory. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.
zh

[NLP-10] ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model

【速读】：该论文旨在解决现有解释框架无法全面理解In-Context Learning (ICL)与Chain-of-Thought (CoT)联合影响的问题。论文的关键解决方案是提出电子电路模型（Electronic Circuit Model, ECM），将模型行为类比为电子电路，其中ICL被视作提供额外电压的语义磁场，而CoT则被建模为限制输出性能的串联电阻。通过这种模型，ECM能够有效预测和解释大型语言模型（LLMs）在多种提示策略下的性能，并在国际信息学奥林匹克竞赛（IOI）和国际数学奥林匹克竞赛（IMO）等任务中实现了超越近80%顶尖人类选手的表现。

链接: https://arxiv.org/abs/2502.03325
作者: Qiguang Chen,Libo Qin,Jinhao Liu,Dengyun Peng,Jiaqi Wang,Mengkang Hu,Zhi Chen,Wanxiang Che,Ting Liu
机构: Research Center for Social Computing and Information Retrieval (社会计算与信息检索研究中心); Harbin Institute of Technology (哈尔滨工业大学); School of Computer Science and Engineering, Central South University (中南大学计算机科学与工程学院); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学); ByteDance Seed (China) (字节跳动种子实验室(中国))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Manuscript

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have led to significant successes across various applications, where the most noticeable is to a series of emerging capabilities, particularly in the areas of In-Context Learning (ICL) and Chain-of-Thought (CoT). To better understand and control model performance, many studies have begun investigating the underlying causes of these phenomena and their impact on task outcomes. However, existing explanatory frameworks predominantly focus on isolating and explaining ICL and CoT independently, leading to an incomplete understanding of their combined influence on model performance. To address this gap, we propose the Electronic Circuit Model (ECM), which provides a foundation for developing scalable, learnable policies and improving the management of AI-generated content. Specifically, ECM conceptualizes model behavior as an electronic circuit: ICL is represented as semantic magnetic field to providing an additional voltage following Faraday’s Law, while CoT is modeled as series resistors to constrain the model output performance following Ohm’s Law. Experimental results demonstrate that the ECM effectively predicts and explains LLM performance across a variety of prompting strategies. Furthermore, we apply ECM to advanced reasoning strategy optimization on a series of tasks, such as the International Olympiad in Informatics (IOI) and the International Mathematical Olympiad (IMO), achieving competitive performance that surpasses nearly 80% of top human competitors.
zh

[NLP-11] Out-of-Distribution Detection using Synthetic Data Generation

【速读】：该论文旨在解决在分类系统部署中区分分布内（In-Distribution, InD）和分布外（Out-of-Distribution, OOD）输入数据的问题。由于OOD数据通常难以获取或收集，这成为准确检测OOD数据的一个重大挑战。论文的关键解决方案在于利用大型语言模型（Large Language Models, LLMs）的生成能力，创建高质量的合成OOD代理数据，从而消除对外部OOD数据源的依赖。实验结果表明，该方法显著降低了误报率，并在保持分布内任务高准确性的同时，优于基准方法。

链接: https://arxiv.org/abs/2502.03323
作者: Momin Abbas,Muneeza Azmat,Raya Horesh,Mikhail Yurochkin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a perfect zero in some cases) while maintaining high accuracy on in-distribution tasks, outperforming baseline methods by a significant margin.
zh

[NLP-12] Harmony in Divergence: Towards Fast Accurate and Memory-efficient Zeroth-order LLM Fine-tuning

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在标准一阶（First-Order, FO）微调过程中需要大量内存的问题，这严重限制了其实际应用。为了解决这一问题，论文提出了一种新的分层发散分析方法，揭示了一阶优化与零阶（Zeroth-Order, ZO）优化不同的更新模式，并基于此提出了“发散驱动零阶”（Divergence-driven Zeroth-Order, DiZO）优化方法。关键在于DiZO通过引入投影到ZO更新中，实现了按层级精确缩放的不同幅度更新，从而有效提升了ZO优化的收敛速度和准确性，同时保持了较高的吞吐量，减少了高达48%的训练GPU小时数。

链接: https://arxiv.org/abs/2502.03304
作者: Qitao Tan,Jun Liu,Zheng Zhan,Caiwei Ding,Yanzhi Wang,Jin Lu,Geng Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose \textbfDivergence-driven \textbfZeroth-\textbfOrder (\textbfDiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning.
zh

[NLP-13] MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters

【速读】：该论文旨在解决因患者健康素养差异和复杂医学术语导致的医疗文档可访问性提升受限的问题。解决方案的关键在于开发了MeDiSumQA数据集，通过自动化流程结合基于大型语言模型（LLM）的问题-答案生成与人工质量检查，从而评估各类LLM在面向患者的问答任务中的表现。研究发现，通用型LLM通常优于经过生物医学适应的模型，且自动评价指标与人类判断高度相关。通过发布MeDiSumQA数据集，论文期望促进LLM的发展，以增强患者理解，最终改善医疗结果。

链接: https://arxiv.org/abs/2502.03298
作者: Amin Dada,Osman Alperen Koras,Marie Bauer,Amanda Butler,Kaleb E. Smith,Jens Kleesiek,Julian Friedrich
机构: NVIDIA(英伟达); Institute for AI in Medicine (IKIM), University Hospital Essen, Germany(德国埃森大学医院人工智能研究所); Cancer Research Center Cologne Essen (CCCE), University Medicine Essen, Germany(科隆埃森大学医学中心癌症研究中心); German Cancer Consortium (DKTK, Partner site Essen), Germany(德国癌症联合会（DKTK，埃森合作站点）); Department of Physics, TU Dortmund, Germany(德国多特蒙德工业大学物理系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While increasing patients’ access to medical documents improves medical care, this benefit is limited by varying health literacy levels and complex medical terminology. Large language models (LLMs) offer solutions by simplifying medical information. However, evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline combining LLM-based question-answer generation with manual quality checks. We use this dataset to evaluate various LLMs on patient-oriented question-answering. Our findings reveal that general-purpose LLMs frequently surpass biomedical-adapted models, while automated metrics correlate with human judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the development of LLMs to enhance patient understanding and ultimately improve care outcomes.
zh

[NLP-14] ALPET: Active Few-shot Learning for Citation Worthiness Detection in Low-Resource Wikipedia Languages

【速读】：该论文旨在解决引用值得性检测（Citation Worthiness Detection, CWD）在资源有限的语言环境中数据不足的问题。解决方案的关键在于引入了ALPET框架，该框架结合了主动学习（Active Learning, AL）和模式挖掘训练（Pattern-Exploiting Training, PET），能够显著减少所需标注数据量，同时在加泰罗尼亚语、巴斯克语和阿尔巴尼亚语维基百科数据集上超越现有基准模型CCW，在某些情况下减少超过80%的标注数据。ALPET在300个标注样本后性能趋于平稳，表明其适用于低资源场景。

链接: https://arxiv.org/abs/2502.03292
作者: Aida Halitaj,Arkaitz Zubiaga
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Citation Worthiness Detection (CWD) consists in determining which sentences, within an article or collection, should be backed up with a citation to validate the information it provides. This study, introduces ALPET, a framework combining Active Learning (AL) and Pattern-Exploiting Training (PET), to enhance CWD for languages with limited data resources. Applied to Catalan, Basque, and Albanian Wikipedia datasets, ALPET outperforms the existing CCW baseline while reducing the amount of labeled data in some cases above 80%. ALPET’s performance plateaus after 300 labeled samples, showing it suitability for low-resource scenarios where large, labeled datasets are not common. While specific active learning query strategies, like those employing K-Means clustering, can offer advantages, their effectiveness is not universal and often yields marginal gains over random sampling, particularly with smaller datasets. This suggests that random sampling, despite its simplicity, remains a strong baseline for CWD in constraint resource environments. Overall, ALPET’s ability to achieve high performance with fewer labeled samples makes it a promising tool for enhancing the verifiability of online content in low-resource language settings.
zh

[NLP-15] SymAgent : A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs

【速读】：该论文旨在解决大型语言模型（LLMs）在处理复杂推理任务时容易产生幻觉的问题，导致错误的结果。为了解决这一问题，研究人员引入了知识图谱（KGs）以提升LLMs的推理能力。然而，现有方法存在两个局限：一是假设所有答案都包含在知识图谱中，忽视了知识图谱的不完整性；二是将知识图谱视为静态资源，忽略了其中隐含的逻辑推理结构。为此，本文提出SymAgent，这是一种创新的神经符号代理框架，实现了知识图谱与大型语言模型之间的协作增强。关键在于将知识图谱视为动态环境，并将复杂的推理任务转化为多步交互过程，使知识图谱能够深入参与推理过程。SymAgent由Agent-Planner和Agent-Executor两个模块组成，前者利用LLM的归纳推理能力从知识图谱中提取符号规则，指导高效的问题分解；后者自主调用预定义的动作工具，整合来自知识图谱和外部文档的信息，从而应对知识图谱不完整的问题。此外，设计了一种自我学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理路径并提高性能。实验结果表明，即使使用较弱的LLM基础模型（如7B系列），SymAgent也能实现优于或可比于各种强大基线模型的性能。进一步分析显示，该代理能够识别缺失的三元组，促进自动的知识图谱更新。

链接: https://arxiv.org/abs/2502.03283
作者: Ben Liu,Jihai Zhang,Fangquan Lin,Cheng Yang,Min Peng,Wotao Yin
机构: DAMO Academy, Alibaba Group(达摩院，阿里巴巴集团); School of Computer Science, Wuhan University(武汉大学计算机学院); DAMO Academy, Alibaba Group US(达摩院，阿里巴巴集团美国); Alibaba Group(阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements have highlighted that Large Language Models (LLMs) are prone to hallucinations when solving complex reasoning problems, leading to erroneous results. To tackle this issue, researchers incorporate Knowledge Graphs (KGs) to improve the reasoning ability of LLMs. However, existing methods face two limitations: 1) they typically assume that all answers to the questions are contained in KGs, neglecting the incompleteness issue of KGs, and 2) they treat the KG as a static repository and overlook the implicit logical reasoning structures inherent in KGs. In this paper, we introduce SymAgent, an innovative neural-symbolic agent framework that achieves collaborative augmentation between KGs and LLMs. We conceptualize KGs as dynamic environments and transform complex reasoning tasks into a multi-step interactive process, enabling KGs to participate deeply in the reasoning process. SymAgent consists of two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages LLM’s inductive reasoning capability to extract symbolic rules from KGs, guiding efficient question decomposition. The Agent-Executor autonomously invokes predefined action tools to integrate information from KGs and external documents, addressing the issues of KG incompleteness. Furthermore, we design a self-learning framework comprising online exploration and offline iterative policy updating phases, enabling the agent to automatically synthesize reasoning trajectories and improve performance. Experimental results demonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields better or comparable performance compared to various strong baselines. Further analysis reveals that our agent can identify missing triples, facilitating automatic KG updates.
zh

[NLP-16] oken Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

【速读】：该论文旨在解决大型语言模型（LLMs）在进行推理和规划时因链式思维（CoT）数据导致输入长度过长及计算资源消耗大的问题。论文的关键在于提出了一种混合表示方法，通过使用VQ-VAE生成的潜在离散标记部分抽象初始推理步骤，显著缩短了推理轨迹的长度。这种方法在训练模型解决钥匙寻找迷宫问题以及微调LLMs处理逻辑和数学推理问题时，通过引入一种简单训练程序，随机混合潜在标记和文本标记，实现了快速适应新标记，从而在各种基准测试中持续优于基线方法。

链接: https://arxiv.org/abs/2502.03275
作者: DiJia Su,Hanlin Zhu,Yingchen Xu,Jiantao Jiao,Yuandong Tian,Qinqing Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data, where the step-by-step thought process is explicitly outlined by text tokens. However, this results in lengthy inputs where many words support textual coherence rather than core reasoning information, and processing these inputs consumes substantial computation resources. In this work, we propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens generated by VQ-VAE, significantly reducing the length of reasoning traces. We explore the use of latent trace abstractions in two scenarios: 1) training the model from scratch for the Keys-Finding Maze problem, 2) fine-tuning LLMs on this hybrid data with an extended vocabulary including unseen latent tokens, for both logical and mathematical reasoning problems. To facilitate effective learning, we introduce a simple training procedure that randomly mixes latent and text tokens, which enables fast adaptation to new latent tokens. Our approach consistently outperforms the baselines methods in various benchmarks.
zh

[NLP-17] Efficient extraction of medication information from clinical notes: an evaluation in two languages

【速读】：该论文旨在评估一种新的自然语言处理（Natural Language Processing, NLP）方法在提取临床叙述中药物信息的准确性、计算成本和可移植性。论文的关键解决方案在于提出了一种基于变换器的架构，用于实体及其关系的提取，该架构不仅在法语和英语临床文档中实现了与现有最先进方法相当的性能（F测度分别为0.82和0.96对比0.81和0.95），同时将计算成本降低了10%。此外，该架构在端到端药物信息提取任务中的F1得分分别为0.69（法语文本）和0.82（英语文本），证明了其高效性和较低的计算需求，适合于通常资源有限的医院IT环境。

链接: https://arxiv.org/abs/2502.03257
作者: Thibaut Fabacher,Erik-André Sauleau,Emmanuelle Arcay,Bineta Faye,Maxime Alter,Archia Chahard,Nathan Miraillet,Adrien Coulet,Aurélie Névéol
机构: University hospital of Strasbourg; ICube Laboratory, Strasbourg, France; Inria, Inserm, Université Paris Cité, U1346 HeKA, Paris, France; Université Paris-Saclay, CNRS, LISN, Orsay, France
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to JAMIA, 17 pages, 3 figures, 2 tables and 5 supplementary tables

点击查看摘要

Abstract:Objective: To evaluate the accuracy, computational cost and portability of a new Natural Language Processing (NLP) method for extracting medication information from clinical narratives. Materials and Methods: We propose an original transformer-based architecture for the extraction of entities and their relations pertaining to patients’ medication regimen. First, we used this approach to train and evaluate a model on French clinical notes, using a newly annotated corpus from Hôpitaux Universitaires de Strasbourg. Second, the portability of the approach was assessed by conducting an evaluation on clinical documents in English from the 2018 n2c2 shared task. Information extraction accuracy and computational cost were assessed by comparison with an available method using transformers. Results: The proposed architecture achieves on the task of relation extraction itself performance that are competitive with the state-of-the-art on both French and English (F-measures 0.82 and 0.96 vs 0.81 and 0.95), but reduce the computational cost by 10. End-to-end (Named Entity recognition and Relation Extraction) F1 performance is 0.69 and 0.82 for French and English corpus. Discussion: While an existing system developed for English notes was deployed in a French hospital setting with reasonable effort, we found that an alternative architecture offered end-to-end drug information extraction with comparable extraction performance and lower computational impact for both French and English clinical text processing, respectively. Conclusion: The proposed architecture can be used to extract medication information from clinical text with high performance and low computational cost and consequently suits with usually limited hospital IT resources
zh

[NLP-18] How do Humans and Language Models Reason About Creativity? A Comparative Analysis

【速读】：该论文旨在探究人类与人工智能在评估科学和工程领域创造力时的认知过程和偏见。关键在于通过两组实验分析包括示例解决方案及其评分是否影响创造力评价，并考察专家与先进语言模型（LLM）在评价原创性时侧重于哪些方面，如新颖性和离散性。结果显示，提供示例可以改善模型预测原始评分的准确性，并显著增强对原创性的多个方面的相关性，揭示了人类与AI在评估创意时存在的差异及偏好。

链接: https://arxiv.org/abs/2502.03253
作者: Antonio Laverghetta Jr.,Tuhin Chakrabarty,Tom Hope,Jimmy Pronchick,Krupa Bhawsar,Roger E. Beaty
机构: Pennsylvania State University (宾夕法尼亚州立大学); Stony Brook University (石溪大学); Hebrew University of Jerusalem (耶路撒冷希伯来大学)
类目: Computation and Language (cs.CL)
备注: CogSci 2025

点击查看摘要

Abstract:Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how including example solutions with ratings impact creativity evaluation, using a finegrained annotation protocol where raters were tasked with explaining their originality scores and rating for the facets of remoteness (whether the response is “far” from everyday ideas), uncommonness (whether the response is rare), and cleverness. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (example) to those who did not (no example). Computational text analysis revealed that, compared to experts with examples, no-example experts used more comparative language (e.g., “better/worse”) and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the example condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially - to upwards of 0.99 - suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.
zh

[NLP-19] A scale of conceptual orality and literacy: Automatic text categorization in the tradition of “N"ahe und Distanz”

【速读】：该论文旨在解决如何将Koch和Oesterreicher的"Nähe und Distanz"模型应用于基于语料库的语言学分析中的统计基础问题。论文的关键在于通过主成分分析（PCA）建立了一个区分概念口语性和书面性的量表，并将其与自动分析相结合。通过分析两个新高地德语语料库，论文发现需要区分概念口语性和书面性特征，以便更精细地评估文本。这一方法相较于Biber的第一维度，在支持和控制任务中表现出更高的适用性。

链接: https://arxiv.org/abs/2502.03252
作者: Volker Emmrich
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Koch and Oesterreicher’s model of “Nähe und Distanz” (Nähe = immediacy, conceptual orality; Distanz = distance, conceptual literacy) is constantly used in German linguistics. However, there is no statistical foundation for use in corpus linguistic analyzes, while it is increasingly moving into empirical corpus linguistics. Theoretically, it is stipulated, among other things, that written texts can be rated on a scale of conceptual orality and literacy by linguistic features. This article establishes such a scale based on PCA and combines it with automatic analysis. Two corpora of New High German serve as examples. When evaluating established features, a central finding is that features of conceptual orality and literacy must be distinguished in order to rank texts in a differentiated manner. The scale is also discussed with a view to its use in corpus compilation and as a guide for analyzes in larger corpora. With a theory-driven starting point and as a “tailored” dimension, the approach compared to Biber’s Dimension 1 is particularly suitable for these supporting, controlling tasks.
zh

[NLP-20] Mitigating Language Bias in Cross-Lingual Job Retrieval: A Recruitment Platform Perspective AAAI2025

【速读】：该论文旨在解决在线招聘平台中简历和职位描述文本成分分析不统一导致的整体泛化能力受限的问题。解决方案的关键在于提出了一种利用多任务双编码器框架的统一句编码器，能够同时学习多个组件，从而提升招聘相关文本处理的整体性能。此外，论文还引入了一种新的评估指标——语言偏差Kullback-Leibler散度（Language Bias Kullback-Leibler Divergence, LBKL），以评估编码器中的语言偏差，展示了显著的偏差减少和卓越的跨语言性能。

链接: https://arxiv.org/abs/2502.03220
作者: Napat Laosaengpha,Thanit Tativannarat,Attapol Rutherford,Ekapol Chuangsuwanich
机构: 未知
类目: Computation and Language (cs.CL)
备注: To be published in CompJobs Workshop at AAAI 2025

点击查看摘要

Abstract:Understanding the textual components of resumes and job postings is critical for improving job-matching accuracy and optimizing job search systems in online recruitment platforms. However, existing works primarily focus on analyzing individual components within this information, requiring multiple specialized tools to analyze each aspect. Such disjointed methods could potentially hinder overall generalizability in recruitment-related text processing. Therefore, we propose a unified sentence encoder that utilized multi-task dual-encoder framework for jointly learning multiple component into the unified sentence encoder. The results show that our method outperforms other state-of-the-art models, despite its smaller model size. Moreover, we propose a novel metric, Language Bias Kullback-Leibler Divergence (LBKL), to evaluate language bias in the encoder, demonstrating significant bias reduction and superior cross-lingual performance.
zh

[NLP-21] VISPAR – An Interactive Visual-Spatial Reasoning Benchmark for VLMs

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在空间推理和视觉对齐方面存在的局限性。为此，论文提出了一种名为iVISPAR的交互式多模态基准测试，用于评估VLMs作为代理执行任务时的空间推理能力。iVISPAR基于滑动拼图游戏的一个变体，通过提供视觉2D、3D和基于文本的输入模态，全面评估VLMs的规划和推理技能。关键解决方案在于设计这一综合性的基准测试，以揭示当前VLMs在复杂空间任务上的不足，并强调其在实现人类水平认知方面的限制。

链接: https://arxiv.org/abs/2502.03214
作者: Julius Mayer,Mohamad Ballout,Serwan Jassim,Farbod Nosrat Nezami,Elia Bruni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. iVISPAR is based on a variant of the sliding tile puzzle-a classic problem that demands logical planning, spatial awareness, and multi-step reasoning. The benchmark supports visual 2D, 3D, and text-based input modalities, enabling comprehensive assessments of VLMs’ planning and reasoning skills. We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task’s complexity and feasibility for humans. Results indicate that while some VLMs perform well on simple spatial tasks, they encounter difficulties with more complex configurations and problem properties. Notably, while VLMs generally perform better in 2D vision compared to 3D or text-based representations, they consistently fall short of human performance, illustrating the persistent challenge of visual alignment. This highlights critical gaps in current VLM capabilities, highlighting their limitations in achieving human-level cognition.
zh

[NLP-22] Improve Decoding Factuality by Token-wise Cross Layer Entropy of Large Language Models NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成内容时存在的幻觉问题，即即使模型具备正确的知识，也常会产生不准确或虚构的信息。论文的关键解决方案是提出了一种名为跨层熵增强解码（Entropy eNhanced Decoding, END）的方法。END通过利用各层内部概率变化，逐个量化每个候选token所需的事实性知识，并调整最终预测分布以优先选择更具事实性的token，从而在不需额外训练的情况下减轻幻觉问题。

链接: https://arxiv.org/abs/2502.03199
作者: Jialiang Wu,Yi Shen,Sijia Liu,Yi Tang,Sen Song,Xiaoyi Wang,Longjun Cai
机构: Harbin Institute of Technology(哈尔滨工业大学); Beijing Wispirit Technology(北京维思科技); Xuanwu Hospital(宣武医院); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 Findings

点击查看摘要

Abstract:Despite their impressive capacities, Large language models (LLMs) often struggle with the hallucination issue of generating inaccurate or fabricated content even when they possess correct knowledge. In this paper, we extend the exploration of the correlation between hidden-state prediction changes and output factuality into a deeper, token-wise level. Based on the insights , we propose cross-layer Entropy eNhanced Decoding (END), a decoding method that mitigates hallucinations without requiring extra training. END leverages inner probability changes across layers to individually quantify the factual knowledge required for each candidate token, and adjusts the final predicting distribution to prioritize tokens with higher factuality. Experiments on both hallucination and QA benchmarks demonstrate that END significantly enhances the truthfulness and informativeness of generated content while maintaining robust QA accuracy. Moreover, our work provides a deeper perspective on understanding the correlations between inherent knowledge and output factuality.
zh

[NLP-23] Euska~nolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching

【速读】：该论文旨在解决Basque（巴斯克语）与Spanish（西班牙语）在伊比利亚半岛北部接触过程中代码转换（Code-switching, CS）数据稀缺的问题。解决方案的关键在于开发一种基于自然来源的语料库（EuskañolDS），通过使用语言识别模型从现有语料库中识别代码转换文本，并经过人工验证以获得可靠的代码转换实例子集。

链接: https://arxiv.org/abs/2502.03188
作者: Maite Heredia,Jeremy Barnes,Aitor Soroa
机构: HiTZ Center - Ixa, University of the Basque Country UPV/EHU
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due a lack of relevant data. In the context of the contact between the Basque and Spanish languages in the north of the Iberian Peninsula, CS frequently occurs in both formal and informal spontaneous interactions. However, resources to analyse this phenomenon and support the development and evaluation of models capable of understanding and generating code-switched language for this language pair are almost non-existent. We introduce a first approach to develop a naturally sourced corpus for Basque-Spanish code-switching. Our methodology consists of identifying CS texts from previously available corpora using language identification models, which are then manually validated to obtain a reliable subset of CS instances. We present the properties of our corpus and make it available under the name EuskañolDS.
zh

[NLP-24] Scalable In-Context Learning on Tabular Data via Retrieval-Augmented Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理表格数据时因序列长度限制而仅限于少量样本场景的问题。解决方案的关键在于提出了一种针对表格数据定制的检索增强型LLM方法，通过结合自定义检索模块和检索引导的指令调优，使LLMs能够有效利用更大规模的数据集，从而显著提升性能，并展现出良好的扩展性。

链接: https://arxiv.org/abs/2502.03147
作者: Xumeng Wen,Shun Zheng,Zhen Xu,Yiming Sun,Jiang Bian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent studies have shown that large language models (LLMs), when customized with post-training on tabular data, can acquire general tabular in-context learning (TabICL) capabilities. These models are able to transfer effectively across diverse data schemas and different task domains. However, existing LLM-based TabICL approaches are constrained to few-shot scenarios due to the sequence length limitations of LLMs, as tabular instances represented in plain text consume substantial tokens. To address this limitation and enable scalable TabICL for any data size, we propose retrieval-augmented LLMs tailored to tabular data. Our approach incorporates a customized retrieval module, combined with retrieval-guided instruction-tuning for LLMs. This enables LLMs to effectively leverage larger datasets, achieving significantly improved performance across 69 widely recognized datasets and demonstrating promising scaling behavior. Extensive comparisons with state-of-the-art tabular models reveal that, while LLM-based TabICL still lags behind well-tuned numeric models in overall performance, it uncovers powerful algorithms under limited contexts, enhances ensemble diversity, and excels on specific datasets. These unique properties underscore the potential of language as a universal and accessible interface for scalable tabular data learning.
zh

[NLP-25] aching Large Language Models Number-Focused Headline Generation With Key Element Rationales NAACL2025

【速读】：该论文旨在解决新闻标题生成任务中既需要高质量文本又需要精确数值的问题，这是大型语言模型（Large Language Models, LLMs）面临的独特挑战。现有研究主要集中在文本质量或数值推理中的一个方面，无法全面应对这一挑战。论文的关键解决方案是提出了一种新的链式思维框架，通过包含主题（Topic）、实体（Entities）和数值推理（Numerical reasoning）三个关键要素的论证（rationales），增强LLMs生成与主题一致且数值精确的高质量文本的能力。具体而言，采用教师模型生成这些论证作为监督数据，然后用于训练和微调学生模型。这种方法使学生模型能够自动生成增强数值推理能力的论证，并实现与主题一致的精确数值新闻标题生成。实验表明，该方法在文本质量和数值准确性方面均表现出色。

链接: https://arxiv.org/abs/2502.03129
作者: Zhen Qian,Xiuzhen Zhang,Xiaofei Xu,Feng Xia
机构: School of Computing Technologies, RMIT University (RMIT大学计算技术学院), Australia (澳大利亚)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Pre-print for a paper accepted to findings of NAACL 2025

点击查看摘要

Abstract:Number-focused headline generation is a summarization task requiring both high textual quality and precise numerical accuracy, which poses a unique challenge for Large Language Models (LLMs). Existing studies in the literature focus only on either textual quality or numerical reasoning and thus are inadequate to address this challenge. In this paper, we propose a novel chain-of-thought framework for using rationales comprising key elements of the Topic, Entities, and Numerical reasoning (TEN) in news articles to enhance the capability for LLMs to generate topic-aligned high-quality texts with precise numerical accuracy. Specifically, a teacher LLM is employed to generate TEN rationales as supervision data, which are then used to teach and fine-tune a student LLM. Our approach teaches the student LLM automatic generation of rationales with enhanced capability for numerical reasoning and topic-aligned numerical headline generation. Experiments show that our approach achieves superior performance in both textual quality and numerical accuracy.
zh

[NLP-26] Policies and Evaluation for Online Meeting Summarization

【速读】：该论文旨在解决在线会议摘要（Online Meeting Summarization）的问题，不同于以往离线会议摘要（Offline Meeting Summarization）的研究。论文的关键解决方案在于提出几种在线摘要策略（policies），并定义了新的评估指标来衡量延迟（latency）和部分摘要质量（partial summary quality）。实验结果表明，在线模型能够生成高质量的摘要，自适应策略优于固定时间表的策略。这些发现为更广泛的研究社区探索这一重要任务提供了起点。

链接: https://arxiv.org/abs/2502.03111
作者: Felix Schneider(1),Marco Turchi(1),Alex Waibel(2) ((1) Zoom Communications, (2) Karlsruhe Institute of Technology)
机构: Zoom Video Communications; Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:With more and more meetings moving to a digital domain, meeting summarization has recently gained interest in both academic and commercial research. However, prior academic research focuses on meeting summarization as an offline task, performed after the meeting concludes. In this paper, we perform the first systematic study of online meeting summarization. For this purpose, we propose several policies for conducting online summarization. We discuss the unique challenges of this task compared to the offline setting and define novel metrics to evaluate latency and partial summary quality. The experiments on the AutoMin dataset show that 1) online models can produce strong summaries, 2) our metrics allow a detailed analysis of different systems’ quality-latency trade-off, also taking into account intermediate outputs and 3) adaptive policies perform better than fixed scheduled ones. These findings provide a starting point for the wider research community to explore this important task.
zh

[NLP-27] Structured Token Retention and Computational Memory Paths in Large Language Models

【速读】：该论文旨在解决在处理长序列时传统方法导致的内存利用效率低下和信息过早丢失的问题。关键解决方案在于引入Structured Token Retention (STR) 和Computational Memory Paths (CMP)，通过动态调整令牌持久性以及分层内存分配，优化了内存资源的分配，提高了长输入序列中的令牌生存率，并减少了累积误差传播，同时优化了大规模生成架构中的信息检索效率。

链接: https://arxiv.org/abs/2502.03102
作者: Jonathan Delena,Augustin Moreau,Dominic Ravensdale,Frederick Chatterton
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory retention mechanisms play a central role in determining the efficiency of computational architectures designed for processing extended sequences. Conventional methods for token management often impose fixed retention thresholds or rely on uniform attention weight distributions, leading to inefficient memory utilization and premature information loss in extended sequence modeling. Structured Token Retention (STR) introduces a probabilistic selection framework that dynamically adjusts token persistence based on contextual significance, ensuring that computational resources are allocated to semantically relevant elements. Computational Memory Paths (CMP) extend this framework through hierarchical memory allocation, refining retention efficiency through structured reallocation of token embeddings. Comparative assessments against baseline models demonstrate that STR and CMP improve token survival rates across long input sequences while reducing cumulative error propagation across processing layers. Experimental results further indicate reductions in computational overhead, improving inference speed without degrading contextual coherence. Token distribution analyses reveal that structured memory allocation prevents excessive redundancy in attention weight calculations, optimizing information retrieval efficiency in large-scale generative architectures. The integration of STR and CMP into an open-source model illustrates the adaptability of structured memory retention methodologies, highlighting their applicability in generative text processing, long-context comprehension, and scalable sequence modeling.
zh

[NLP-28] IAO Prompting: Making Knowledge Flow Explicit in LLM s through Structured Reasoning Templates AAAI2025

【速读】：该论文旨在解决理解与验证大型语言模型（LLMs）在复杂推理任务中知识利用的问题。关键解决方案是引入了一种名为输入-动作-输出（IAO）提示方法，通过将问题分解为一系列明确标识输入知识、执行动作和产生输出的步骤，从而实现对LLMs如何访问和应用其知识的显式建模。这种方法不仅提高了零样本性能，还增强了我们对LLMs如何利用其存储知识的理解透明度。

链接: https://arxiv.org/abs/2502.03080
作者: Aissatou Diallo,Antonis Bikakis,Luke Dickens,Anthony Hunter,Rob Miller
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted as Oral at KnowFM @ AAAI 2025

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate impressive reasoning capabilities, understanding and validating their knowledge utilization remains challenging. Chain-of-thought (CoT) prompting partially addresses this by revealing intermediate reasoning steps, but the knowledge flow and application remain implicit. We introduce IAO (Input-Action-Output) prompting, a structured template-based method that explicitly models how LLMs access and apply their knowledge during complex reasoning tasks. IAO decomposes problems into sequential steps, each clearly identifying the input knowledge being used, the action being performed, and the resulting output. This structured decomposition enables us to trace knowledge flow, verify factual consistency, and identify potential knowledge gaps or misapplications. Through experiments across diverse reasoning tasks, we demonstrate that IAO not only improves zero-shot performance but also provides transparency in how LLMs leverage their stored knowledge. Human evaluation confirms that this structured approach enhances our ability to verify knowledge utilization and detect potential hallucinations or reasoning errors. Our findings provide insights into both knowledge representation within LLMs and methods for more reliable knowledge application.
zh

[NLP-29] DOLFIN – Document-Level Financial test set for Machine Translation NAACL2025

【速读】：该论文旨在解决文档级机器翻译（Document-level Machine Translation, MT）测试集稀缺的问题，特别是在专业领域如法律和金融方面的不足。现有测试集主要覆盖通用领域，并且仍遵循句子级逻辑，无法涵盖某些语言现象如信息重组。为了解决这些问题，论文提出了一种新的测试集DOLFIN，该数据集由专业金融文档构建，并以章节而非句子为单位提供数据，从而实现真正的文档级MT。关键在于放弃完美对齐句子的范式，转而采用章节级别的数据组织方式。

链接: https://arxiv.org/abs/2502.03053
作者: Mariam Nakhlé,Marco Dinarelli,Raheel Qader,Emmanuelle Esperança-Rodier,Hervé Blanchon
机构: Univ. Grenoble Alpes, CNRS, Grenoble INP(格勒诺布尔阿尔卑斯大学, 法国国家科学研究中心, 格勒诺布尔国立理工学院); Institute of Engineering Univ. Grenoble Alpes, LIG, 38000 Grenoble, France(格勒诺布尔阿尔卑斯大学工程学院, LIG, 法国38000格勒诺布尔); Lingua Custodia(林古阿·库斯托迪亚), 75008 Paris, France(法国巴黎75008)
类目: Computation and Language (cs.CL)
备注: To be published in NAACL 2025 Findings

点击查看摘要

Abstract:Despite the strong research interest in document-level Machine Translation (MT), the test sets dedicated to this task are still scarce. The existing test sets mainly cover topics from the general domain and fall short on specialised domains, such as legal and financial. Also, in spite of their document-level aspect, they still follow a sentence-level logic that does not allow for including certain linguistic phenomena such as information reorganisation. In this work, we aim to fill this gap by proposing a novel test set: DOLFIN. The dataset is built from specialised financial documents, and it makes a step towards true document-level MT by abandoning the paradigm of perfectly aligned sentences, presenting data in units of sections rather than sentences. The test set consists of an average of 1950 aligned sections for five language pairs. We present a detailed data collection pipeline that can serve as inspiration for aligning new document-level datasets. We demonstrate the usefulness and quality of this test set by evaluating a number of models. Our results show that the test set is able to discriminate between context-sensitive and context-agnostic models and shows the weaknesses when models fail to accurately translate financial texts. The test set is made public for the community.
zh

[NLP-30] Knowledge Distillation from Large Language Models for Household Energy Modeling

【速读】：该论文旨在解决因数据隐私限制导致的智能电网研究中现实且多样化数据获取困难的问题。论文的关键解决方案是通过整合大型语言模型（Large Language Models, LLMs）生成具有文化敏感性和行为特异性的家庭能源使用数据。论文采用五种不同的LLMs，并通过四阶段方法合成包括文化活动、天气范围、暖通空调操作以及独特的“能源特征”在内的日常数据，从而实现这一目标。此外，论文还探讨了一种直接集成外部天气数据集的方法，以确保物理上一致的数据输入。这种方法不仅有助于理解文化、气候和行为因素如何共同影响碳排放，也为基于场景的能源优化提供了成本效益高的途径。

链接: https://arxiv.org/abs/2502.03034
作者: Mohannad Takrouri,Nicolás M. Cuadrado,Martin Takáč
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Source code is available at this https URL

点击查看摘要

Abstract:Machine learning (ML) is increasingly vital for smart-grid research, yet restricted access to realistic, diverse data - often due to privacy concerns - slows progress and fuels doubts within the energy sector about adopting ML-based strategies. We propose integrating Large Language Models (LLMs) in energy modeling to generate realistic, culturally sensitive, and behavior-specific data for household energy usage across diverse geographies. In this study, we employ and compare five different LLMs to systematically produce family structures, weather patterns, and daily consumption profiles for households in six distinct countries. A four-stage methodology synthesizes contextual daily data, including culturally nuanced activities, realistic weather ranges, HVAC operations, and distinct `energy signatures’ that capture unique consumption footprints. Additionally, we explore an alternative strategy where external weather datasets can be directly integrated, bypassing intermediate weather modeling stages while ensuring physically consistent data inputs. The resulting dataset provides insights into how cultural, climatic, and behavioral factors converge to shape carbon emissions, offering a cost-effective avenue for scenario-based energy optimization. This approach underscores how prompt engineering, combined with knowledge distillation, can advance sustainable energy research and climate mitigation efforts. Source code is available at this https URL .
zh

[NLP-31] Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

【速读】：该论文旨在解决大型语言模型中特征在不同层之间如何演变及其对模型行为的影响。关键在于提出了一种基于无数据余弦相似度的技术，用于系统地映射稀疏自动编码器在连续层中发现的特征，并追踪这些特征在每一阶段的保持、转变或首次出现情况。这种方法生成了特征演化的粒度化流程图，从而实现了对模型计算机制的细粒度可解释性，并提供了通过放大或抑制选定特征来直接操控模型行为的新手段，以实现文本生成中的目标主题控制。

链接: https://arxiv.org/abs/2502.03032
作者: Daniil Laptev,Nikita Balagansky,Yaroslav Aksenov,Daniil Gavrilov
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
zh

[NLP-32] Scaling Laws for Upcycling Mixture-of-Experts Language Models

【速读】：该论文旨在研究通过再利用（upcycling）较小的语言模型来训练大型混合专家模型（Mixture-of-Experts, MoE），以缓解大规模语言模型（LLMs）预训练过程中的计算资源需求。论文的关键在于探索并识别出描述性能如何依赖于数据集大小和模型配置的经验性缩放规律。研究发现，尽管增加这些因素可以提升性能，但在大规模计算预算下，密集训练数据集与再利用训练数据集之间的新型交互项限制了再利用方法的效率。基于这些发现，论文提供了如何优化再利用策略的指导，并确立了在预算约束条件下再利用优于从头开始训练的条件。

链接: https://arxiv.org/abs/2502.03009
作者: Seng Pei Liew,Takuya Kato,Sho Takase
机构: SB Intuitions (SB直觉)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 figures, 8 tables

点击查看摘要

Abstract:Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
zh

[NLP-33] MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在医学和生物学等专业领域应用时，存在的事实准确性、可靠性和上下文深度不足的问题。解决方案的关键在于引入MedBioLM，这是一个经过领域适应的生物医学问答模型，通过结合微调（fine-tuning）和检索增强生成（retrieval-augmented generation, RAG），MedBioLM能够动态整合特定领域的知识，从而提升推理能力和事实准确性。

链接: https://arxiv.org/abs/2502.03004
作者: Seonok Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across natural language processing tasks. However, their application to specialized domains such as medicine and biology requires further optimization to ensure factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a domain-adapted biomedical question-answering model designed to enhance both short-form and long-form queries. By integrating fine-tuning and retrieval-augmented generation (RAG), MedBioLM dynamically incorporates domain-specific knowledge, improving reasoning abilities and factual accuracy. To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA datasets, covering structured multiple-choice assessments and complex clinical reasoning tasks. Fine-tuning significantly improves accuracy on benchmark datasets, while RAG enhances factual consistency. These results highlight the potential of domain-optimized LLMs in advancing biomedical research, medical education, and clinical decision support.
zh

[NLP-34] raining an LLM -as-a-Judge Model: Pipeline Insights and Practical Lessons WWW’25

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为评估判官的应用问题。论文的关键解决方案是引入Themis，一个经过微调的LLM评估器，能够提供复杂的上下文感知评估。Themis通过场景依赖的评估提示和两种新的可控指令生成方法实现这一目标。这些设计使得Themis能够有效地从教师模型中提取评估技能，同时保持持续发展的灵活性。此外，论文还提出了一种基于指令执行难度的缓解策略，并提供了数据平衡、提示定制、多目标训练和指标聚合的实用指南。

链接: https://arxiv.org/abs/2502.02988
作者: Renjun Hu,Yi Cheng,Libin Meng,Jiaxin Xia,Yi Zong,Xing Shi,Wei Lin
机构: Alibaba Cloud Computing (阿里云); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted at WWW’25 (Industrial Track), extended version

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has opened new possibilities for their adoption as evaluative judges. This paper introduces Themis, a fine-tuned LLM judge that delivers sophisticated context-aware evaluations. We provide a comprehensive overview of the development pipeline for Themis, highlighting its scenario-dependent evaluation prompts and two novel methods for controlled instruction generation. These designs enable Themis to effectively distill evaluative skills from teacher models, while retaining flexibility for continuous development. We introduce two human-labeled benchmarks for meta-evaluation, demonstrating that Themis can achieve high alignment with human preferences in an economical manner. Additionally, we explore insights into the LLM-as-a-judge paradigm, revealing nuances in performance and the varied effects of reference answers. Notably, we observe that pure knowledge distillation from strong LLMs, though common, does not guarantee performance improvement through scaling. We propose a mitigation strategy based on instruction-following difficulty. Furthermore, we provide practical guidelines covering data balancing, prompt customization, multi-objective training, and metric aggregation. We aim for our method and findings, along with the fine-tuning data, benchmarks, and model checkpoints, to support future research and development in this area.
zh

[NLP-35] Position: Editing Large Language Models Poses Serious Safety Risks

【速读】：该论文旨在解决大型语言模型（LLMs）中的知识更新所带来的安全风险。论文指出，现有的知识编辑方法（KEs）虽然在技术上有效，但存在严重的安全隐患，包括被恶意利用的风险以及缺乏验证机制导致的生态系统漏洞。论文的关键解决方案在于呼吁研究抗篡改模型及针对恶意模型编辑的对策，并积极保障AI生态系统的安全性。

链接: https://arxiv.org/abs/2502.02958
作者: Paul Youssef,Zhixue Zhao,Daniel Braun,Jörg Schlötterer,Christin Seifert
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.
zh

[NLP-36] ReachAgent : Enhancing Mobile Agent via Page Reaching and Operation

【速读】：该论文旨在解决现有移动AI代理在处理任务时倾向于关注每个步骤中最相关的元素，从而导致局部最优解并忽略整体GUI流程的问题。解决方案的关键在于构建了一个名为MobileReach的训练数据集，并提出了一种名为ReachAgent的两阶段框架。该框架通过分解任务为页面到达和操作子任务，并结合基于奖励的偏好GUI流程来增强代理的任务完成能力。

链接: https://arxiv.org/abs/2502.02955
作者: Qinzhuo Wu,Wei Liu,Jian Luan,Bin Wang
机构: XiaoMi AI Lab (小米AI实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, mobile AI agents have gained increasing attention. Given a task, mobile AI agents can interact with mobile devices in multiple steps and finally form a GUI flow that solves the task. However, existing agents tend to focus on most task-relevant elements at each step, leading to local optimal solutions and ignoring the overall GUI flow. To address this issue, we constructed a training dataset called MobileReach, which breaks the task into page reaching and operation subtasks. Furthermore, we propose ReachAgent, a two-stage framework that focuses on improving its task-completion abilities. It utilizes the page reaching and page operation subtasks, along with reward-based preference GUI flows, to further enhance the agent. Experimental results show that ReachAgent significantly improves the IoU Acc and Text Acc by 7.12% and 7.69% on the step-level and 4.72% and 4.63% on the task-level compared to the SOTA agent. Our data and code will be released upon acceptance.
zh

[NLP-37] LLM -KT: Aligning Large Language Models with Knowledge Tracing using a Plug-and-Play Instruction

【速读】：本文旨在解决知识追踪（Knowledge Tracing, KT）问题，即预测学生在个性化教育环境中能否正确回答下一个问题，基于其过去的问题回答记录。传统方法主要集中在基于行为ID或文本信息学习行为序列，但通常难以捕捉学生全面的行为模式，尤其是在缺乏丰富世界知识推理能力的情况下。为了解决这一问题，本文提出了一种基于大型语言模型（Large Language Models, LLMs）的知识追踪框架，称为LLM-KT。该框架的关键在于通过指令对齐（Plug-and-Play instruction）将LLMs与KT任务对齐，利用LLMs丰富的知识和强大的推理能力。同时，通过插件上下文和序列设计，整合多模态信息，并引入压缩上下文嵌入和序列适配器来增强LLMs，从而有效捕捉历史记录的长期依赖关系。实验结果表明，LLM-KT在四个典型数据集上的表现达到了当前最先进水平。

链接: https://arxiv.org/abs/2502.02945
作者: Ziwei Wang,Jie Zhou,Qin Chen,Min Zhang,Bo Jiang,Aimin Zhou,Qinchun Bai,Liang He
机构: School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院), China; Shanghai Open University (上海开放大学), China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The knowledge tracing (KT) problem is an extremely important topic in personalized education, which aims to predict whether students can correctly answer the next question based on their past question-answer records. Prior work on this task mainly focused on learning the sequence of behaviors based on the IDs or textual information. However, these studies usually fail to capture students’ sufficient behavioral patterns without reasoning with rich world knowledge about questions. In this paper, we propose a large language models (LLMs)-based framework for KT, named \texttt\textbfLLM-KT, to integrate the strengths of LLMs and traditional sequence interaction models. For task-level alignment, we design Plug-and-Play instruction to align LLMs with KT, leveraging LLMs’ rich knowledge and powerful reasoning capacity. For modality-level alignment, we design the plug-in context and sequence to integrate multiple modalities learned by traditional methods. To capture the long context of history records, we present a plug-in context to flexibly insert the compressed context embedding into LLMs using question-specific and concept-specific tokens. Furthermore, we introduce a plug-in sequence to enhance LLMs with sequence interaction behavior representation learned by traditional sequence models using a sequence adapter. Extensive experiments show that \texttt\textbfLLM-KT obtains state-of-the-art performance on four typical datasets by comparing it with approximately 20 strong baselines.
zh

[NLP-38] LLaVAC: Fine-tuning LLaVA as a Multimodal Sentiment Classifier

【速读】：该论文旨在解决多模态情感分析中的分类问题，提出了一种名为LLaVAC的方法。其关键是通过设计包含单模态和多模态标签的结构化提示，对大型语言和视觉助手（Large Language and Vision Assistant, LLaVA）进行微调，从而实现有效的跨图像和文本模态的情感分类。

链接: https://arxiv.org/abs/2502.02938
作者: T. Chay-intr,Y. Chen,K. Viriyayudhakorn,T. Theeramunkong
机构: Intelligent Informatics and Service Innovation Research Center, Thailand(智能信息学和服务创新研究中心，泰国); iApp Technology Co., Ltd., Thailand(泰国iApp科技有限公司); Panasonic Research and Development on Artificial Intelligence (AI), Japan(日本松下人工智能研究与发展); Artificial Intelligence Entrepreneur Association of Thailand (AIEAT), Thailand(泰国人工智能企业家协会); Sirindhorn International Institute of Technology, Thammasat University, Thailand(泰国玛希敦大学诗琳通国际学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present LLaVAC, a method for constructing a classifier for multimodal sentiment analysis. This method leverages fine-tuning of the Large Language and Vision Assistant (LLaVA) to predict sentiment labels across both image and text modalities. Our approach involves designing a structured prompt that incorporates both unimodal and multimodal labels to fine-tune LLaVA, enabling it to perform sentiment classification effectively. Experiments on the MVSA-Single dataset demonstrate that LLaVAC outperforms existing methods in multimodal sentiment analysis across three data processing procedures. The implementation of LLaVAC is publicly available at this https URL.
zh

[NLP-39] SPARC: Subspace-Aware Prompt Adaptation for Robust Continual Learning in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在连续学习（Continual Learning）过程中面临的知识保留与高效适应任务之间的矛盾。关键在于提出了一种名为SPARC的轻量级框架，通过主成分分析（Principal Component Analysis, PCA）识别紧凑的训练数据子空间，并在此低维空间中进行提示调优（prompt tuning），从而提高训练效率同时保持模型的广泛知识不受损害。此外，结合LoRA进一步增强了对计算资源限制的适应能力。

链接: https://arxiv.org/abs/2502.02909
作者: Dinithi Jayasuriya,Sina Tayebati,Davide Ettori,Ranganath Krishnan,Amit Ranjan Trivedi(Intel Labs, Oregon)
机构: Department of Electrical and Computer Engineering, University of Illinois Chicago(电气与计算机工程系，芝加哥伊利诺伊大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose SPARC, a lightweight continual learning framework for large language models (LLMs) that enables efficient task adaptation through prompt tuning in a lower-dimensional space. By leveraging principal component analysis (PCA), we identify a compact subspace of the training data. Optimizing prompts in this lower-dimensional space enhances training efficiency, as it focuses updates on the most relevant features while reducing computational overhead. Furthermore, since the model’s internal structure remains unaltered, the extensive knowledge gained from pretraining is fully preserved, ensuring that previously learned information is not compromised during adaptation. Our method achieves high knowledge retention in both task-incremental and domain-incremental continual learning setups while fine-tuning only 0.04% of the model’s parameters. Additionally, by integrating LoRA, we enhance adaptability to computational constraints, allowing for a tradeoff between accuracy and training cost. Experiments on the SuperGLUE benchmark demonstrate that our PCA-based prompt tuning combined with LoRA maintains full knowledge retention while improving accuracy, utilizing only 1% of the model’s parameters. These results establish our approach as a scalable and resource-efficient solution for continual learning in LLMs.
zh

[NLP-40] ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

【速读】：该论文旨在解决理解学者写作过程中认知思维过程的问题。解决方案的关键在于收集端到端的写作数据（从个体想法到最终手稿），而不仅仅是最终的文稿。为此，作者引入了ScholaWrite数据集，这是首个包含完整稿件撰写过程中逐按键日志及其认知写作意图详尽注释的数据集。通过这一数据集，研究者能够更好地开发支持科学家认知思维过程的AI写作助手。

链接: https://arxiv.org/abs/2502.02904
作者: Linghe Wang,Minhwa Lee,Ross Volkov,Luan Tuyen Chau,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: Equal contribution: Linghe Wang, Minhwa Lee | project page: this https URL

点击查看摘要

Abstract:Writing is a cognitively demanding task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge. To fully understand writers’ cognitive thought process, one should fully decode the end-to-end writing data (from individual ideas to final manuscript) and understand their complex cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset, the first-of-its-kind keystroke logs of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. ScholaWrite shows promising usability and applications (e.g., iterative self-writing) for the future development of AI writing assistants for academic research, which necessitate complex methods beyond LLM prompting. Our experiments clearly demonstrated the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists. Our de-identified dataset, demo, and code repository are available on our project page.
zh

[NLP-41] What is in a name? Mitigating Name Bias in Text Embeddings via Anonymization

【速读】：该论文旨在解决文本嵌入模型中存在的名称偏差（name bias）问题，这种偏差源于训练数据中的实体名称，如人名、地名、组织名等。论文的关键解决方案是在推理阶段引入文本匿名化（text anonymization），通过去除文本中的名称信息，同时保留其核心主题，从而减轻名称偏差的影响。此方法在两个下游自然语言处理任务中展示了显著的性能提升，并且无需额外的训练或优化过程。

链接: https://arxiv.org/abs/2502.02903
作者: Sahil Manchanda,Pannaga Shivaswamy
机构: Pocket FM(口袋FM), India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of \textitnames such as persons, locations, organizations etc. in the text. Our study shows how the presence of \textitname-bias in text-embedding models can potentially lead to erroneous conclusions in assessment of thematic this http URL-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose \textittext-anonymization during inference which involves removing references to names, while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on two downstream NLP tasks, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias.
zh

[NLP-42] A Benchmark for the Detection of Metalinguistic Disagreements between LLM s and Knowledge Graphs ISWC2024

【速读】：该论文旨在探讨大型语言模型（LLMs）与知识图谱（KGs）之间是否存在元语言分歧（metalinguistic disagreement），这一现象在自然语言处理和生成中可能会影响事实提取任务的准确性。论文的关键解决方案在于提出了一种基准测试方法，用于评估LLMs与KGs之间在检测事实分歧和元语言分歧方面的表现。基于T-REx知识对齐数据集的调查研究，作者假设这种分歧确实存在，并且具有潜在的相关性，对知识图谱工程实践产生影响。初步的概念验证已发布在Github上。

链接: https://arxiv.org/abs/2502.02896
作者: Bradley P. Allen,Paul T. Groth
机构: University of Amsterdam(阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 tables, to appear in Reham Alharbi, Jacopo de Berardinis, Paul Groth, Albert Meroño-Peñuela, Elena Simperl, Valentina Tamma (eds.), ISWC 2024 Special Session on Harmonising Generative AI and Semantic Web Technologies. this http URL (forthcoming), for associated code and data see this https URL

点击查看摘要

Abstract:Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents differ not on facts but on the meaning of the language used to express them. Given the complexity of natural language processing and generation using LLMs, we ask: do metalinguistic disagreements occur between LLMs and KGs? Based on an investigation using the T-REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. We propose a benchmark for evaluating the detection of factual and metalinguistic disagreements between LLMs and KGs. An initial proof of concept of such a benchmark is available on Github.
zh

[NLP-43] Lowering the Barrier of Machine Learning: Achieving Zero Manual Labeling in Review Classification Using LLM s

【速读】：该论文旨在解决中小企业和个人在利用基于机器学习的情感分类技术提升客户满意度方面面临的挑战。关键解决方案在于整合大型语言模型（Large Language Models, LLMs），特别是生成式预训练转换器（Generative Pre-trained Transformer, GPT）和双向编码器表示从转换器（Bidirectional Encoder Representations from Transformers, BERT）模型，从而降低了技术复杂性，使得情感分类技术更加易于应用且无需手动标注或专家知识调参，也减少了对大量计算资源的需求。通过这一方法，论文提出的技术手段显著降低了应用情感分类技术的门槛，增强了中小企业的竞争力，并推动了机器学习技术的普及。

链接: https://arxiv.org/abs/2502.02893
作者: Yejian Zhang,Shingo Takada
机构: Grad. School of Science and Technology, Keio University (科学与技术研究生院, 庆应义塾大学)
类目: Computation and Language (cs.CL)
备注: Accepted to 2025 11th International Conference on Computing and Artificial Intelligence (ICCAI 2025)

点击查看摘要

Abstract:With the internet’s evolution, consumers increasingly rely on online reviews for service or product choices, necessitating that businesses analyze extensive customer feedback to enhance their offerings. While machine learning-based sentiment classification shows promise in this realm, its technical complexity often bars small businesses and individuals from leveraging such advancements, which may end up making the competitive gap between small and large businesses even bigger in terms of improving customer satisfaction. This paper introduces an approach that integrates large language models (LLMs), specifically Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT)-based models, making it accessible to a wider audience. Our experiments across various datasets confirm that our approach retains high classification accuracy without the need for manual labeling, expert knowledge in tuning and data annotation, or substantial computational power. By significantly lowering the barriers to applying sentiment classification techniques, our methodology enhances competitiveness and paves the way for making machine learning technology accessible to a broader audience.
zh

[NLP-44] Achieving Operational Universality through a Turing Complete Chemputer

【速读】：该论文旨在解决化学过程中编程的复杂性和不可预测性问题，通过引入图灵完备性概念至化学合成机器人平台来实现。关键在于利用一种化学感知编程语言XDL，将复杂的化学过程分解为离散的基本单元操作，从而确保化学合成的自动化与可编程性。论文展示了如何通过图灵完备性在颜色空间中的应用，验证了这一方法在探索概念性化学空间的有效性，即通过将RGB颜色空间的1670万种组合简化为5个离散值，并在10个感兴趣区域进行测量，从而模拟化学反应逻辑的复杂性。这种方法为未来化学编程语言提供了一个框架，以确保复杂逻辑运算能够被正确表达和执行，并具备错误纠正能力，支持自动化和自主化合成日益复杂的分子。

链接: https://arxiv.org/abs/2502.02872
作者: Daniel Gahler,Dean Thomas,Slawomir Lach,Leroy Cronin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 7 figures, 28 references

点击查看摘要

Abstract:The most fundamental abstraction underlying all modern computers is the Turing Machine, that is if any modern computer can simulate a Turing Machine, an equivalence which is called Turing completeness, it is theoretically possible to achieve any task that can be algorithmically described by executing a series of discrete unit operations. In chemistry, the ability to program chemical processes is demanding because it is hard to ensure that the process can be understood at a high level of abstraction, and then reduced to practice. Herein we exploit the concept of Turing completeness applied to robotic platforms for chemistry that can be used to synthesise complex molecules through unit operations that execute chemical processes using a chemically-aware programming language, XDL. We leverage the concept of computability by computers to synthesizability of chemical compounds by automated synthesis machines. The results of an interactive demonstration of Turing completeness using the colour gamut and conditional logic are presented and examples of chemical use-cases are discussed. Over 16.7 million combinations of Red, Green, Blue (RGB) colour space were binned into 5 discrete values and measured over 10 regions of interest (ROIs), affording 78 million possible states per step and served as a proxy for conceptual, chemical space exploration. This formal description establishes a formal framework in future chemical programming languages to ensure complex logic operations are expressed and executed correctly, with the possibility of error correction, in the automated and autonomous pursuit of increasingly complex molecules.
zh

[NLP-45] Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

【速读】：该论文旨在解决当前科学推理模型在跨领域泛化和多模态感知方面存在的局限性。论文的关键解决方案在于提出利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）来克服这些限制，并增强科学推理能力。通过引入一个四阶段的研究路线图，论文强调了MLLM在整合和推理多样化数据类型方面的优势，从而推动数学、物理、化学和生物学等学科中的科学推理发展。

链接: https://arxiv.org/abs/2502.02871
作者: Yibo Yan,Shen Wang,Jiahao Huo,Jingheng Ye,Zhendong Chu,Xuming Hu,Philip S. Yu,Carla Gomes,Bart Selman,Qingsong Wen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM’s full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).
zh

[NLP-46] CAMI: A Counselor Agent Supporting Motivational Interviewing through State Inference and Topic Exploration

【速读】：该论文旨在解决通过自动化手段提供可扩展且易获取的心理健康支持的需求，特别是在动机性访谈（Motivational Interviewing, MI）这一以客户为中心的咨询方法背景下。论文提出的关键解决方案是CAMI系统，它基于大型语言模型（LLMs），采用STAR框架，包括客户状态推断、动机话题探索及回应生成模块。论文强调，客户状态推断与话题探索在实现高效自动化咨询中的关键作用。结果表明，CAMI不仅优于现有的先进方法，而且展现出更真实的咨询师行为。

链接: https://arxiv.org/abs/2502.02807
作者: Yizhe Yang,Palakorn Achananuparp,Heyan Huang,Jing Jiang,Kit Phey Leng,Nicholas Gabriel Lim,Cameron Tan Shi Ern,Ee-peng Lim
机构: Beijing Institute of Technology(北京理工大学); Singapore Management University(新加坡管理大学); Australian National University(澳大利亚国立大学); National Institute of Education(国家教育学院); Singapore University of Social Sciences(新加坡社会科学大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational counselor agents have become essential tools for addressing the rising demand for scalable and accessible mental health support. This paper introduces CAMI, a novel automated counselor agent grounded in Motivational Interviewing (MI) – a client-centered counseling approach designed to address ambivalence and facilitate behavior change. CAMI employs a novel STAR framework, consisting of client’s state inference, motivation topic exploration, and response generation modules, leveraging large language models (LLMs). These components work together to evoke change talk, aligning with MI principles and improving counseling outcomes for clients from diverse backgrounds. We evaluate CAMI’s performance through both automated and manual evaluations, utilizing simulated clients to assess MI skill competency, client’s state inference accuracy, topic exploration proficiency, and overall counseling success. Results show that CAMI not only outperforms several state-of-the-art methods but also shows more realistic counselor-like behavior. Additionally, our ablation study underscores the critical roles of state inference and topic exploration in achieving this performance.
zh

[NLP-47] Consistent Client Simulation for Motivational Interviewing-based Counseling

【速读】：该论文旨在解决在精神健康咨询场景中模拟人类客户以训练和评估咨询师（无论是真人还是模拟）时所面临的挑战。具体而言，过去的研究并未关注复杂的对话任务如精神健康咨询，其中难点在于确保客户的行动（即与咨询师的互动）与其设定的个人档案和负面行为设定保持一致。论文的关键解决方案在于提出了一种新颖的框架，该框架能够追踪模拟客户的心理状态，控制其状态转换，并针对每种状态生成与其动机、信念、改变意愿及接受度相一致的行为。通过调整客户档案和接受度，论文展示了如何有效地创建适用于不同咨询情景的一致性模拟客户。

链接: https://arxiv.org/abs/2502.02802
作者: Yizhe Yang,Palakorn Achananuparp,Heyan Huang,Jing Jiang,John Pinto,Jenny Giam,Kit Phey Leng,Nicholas Gabriel Lim,Cameron Tan Shi Ern,Ee-peng Lim
机构: Beijing Institute of Technology(北京理工大学); Singapore Management University(新加坡管理大学); Australian National University(澳大利亚国立大学); ThoughtFull(ThoughtFull); Singapore Institute of Technology(新加坡科技设计大学); National Institute of Education(国家教育学院); Singapore University of Social Sciences(新加坡社会科学大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Simulating human clients in mental health counseling is crucial for training and evaluating counselors (both human or simulated) in a scalable manner. Nevertheless, past research on client simulation did not focus on complex conversation tasks such as mental health counseling. In these tasks, the challenge is to ensure that the client’s actions (i.e., interactions with the counselor) are consistent with with its stipulated profiles and negative behavior settings. In this paper, we propose a novel framework that supports consistent client simulation for mental health counseling. Our framework tracks the mental state of a simulated client, controls its state transitions, and generates for each state behaviors consistent with the client’s motivation, beliefs, preferred plan to change, and receptivity. By varying the client profile and receptivity, we demonstrate that consistent simulated clients for different counseling scenarios can be effectively created. Both our automatic and expert evaluations on the generated counseling sessions also show that our client simulation method achieves higher consistency than previous methods.
zh

[NLP-48] Leverag ing the true depth of LLM s

【速读】：该论文旨在通过减少预训练大型语言模型 (Large Language Models, LLMs) 的深度来降低其推理计算成本，同时保持性能不受显著影响。关键在于通过更好地利用层之间的解耦效应，将某些层分组成可以并行评估的对，从而修改计算图。这种方法在无需重新训练或微调的情况下，平均提高了约1.20倍每秒生成的令牌数，同时保留了原模型95%-99%的准确性。

链接: https://arxiv.org/abs/2502.02790
作者: Ramón Calvo González,Daniele Paliotta,Matteo Pagliardini,Martin Jaggi,François Fleuret
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models demonstrate remarkable capabilities at the cost of high compute requirements. While recent research has shown that intermediate layers can be removed or have their order shuffled without impacting performance significantly, these findings have not been employed to reduce the computational cost of inference. We investigate several potential ways to reduce the depth of pre-trained LLMs without significantly affecting performance. Leveraging our insights, we present a novel approach that exploits this decoupling between layers by grouping some of them into pairs that can be evaluated in parallel. This modification of the computational graph – through better parallelism – results in an average improvement of around 1.20x on the number of tokens generated per second, without re-training nor fine-tuning, while retaining 95%-99% of the original accuracy. Empirical evaluation demonstrates that this approach significantly improves serving efficiency while maintaining model performance, offering a practical improvement for large-scale LLM deployment. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2502.02790 [cs.LG] (or arXiv:2502.02790v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.02790 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-49] Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

【速读】：该论文旨在解决现代大型语言模型（Large Language Model, LLM）推理引擎中的时间至首个标记（Time-to-First-Token, TTFT）优化问题。由于优化TTFT能够直接提升最大查询每秒处理量（Queries Per Second, QPS），并满足许多关键应用的需求。然而，提高TTFT极具挑战性，因为它完全受计算限制约束，并且性能瓶颈从自注意力部分转移到了MLP部分。论文的关键解决方案是提出了一种无需训练的框架SpecPrefill，它通过推测基于上下文的重要标记子集来加速长和中等上下文查询的推理TTFT。该方法的核心在于利用轻量级模型根据上下文推测局部重要标记，这些标记及其必要的位置信息随后被发送到主模型进行处理。

链接: https://arxiv.org/abs/2502.02789
作者: Jingyu Liu,Beidi Chen,Ce Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Because optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is purely compute-bounded and the performance bottleneck shifts from the self-attention to the MLP part. We present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to still preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7\times maximal end-to-end QPS on real downstream tasks and 7.66\times TTFT improvement during benchmarking.
zh

[NLP-50] SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）生成文本的可追溯性问题，提出了一种称为SimMark的后处理水印算法。SimMark的关键在于利用语义句子嵌入的相似性和拒绝采样技术，在不被人类察觉的情况下植入可检测的统计模式，并采用软计数机制以增强对抗释义攻击的鲁棒性。实验结果表明，SimMark在保持文本质量的同时，为LLM生成内容的鲁棒水印设定了新的基准。

链接: https://arxiv.org/abs/2502.02787
作者: Amirhossein Dabiriaghdam,Lele Wang
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 15 pages, 5 tables, 6 figures

点击查看摘要

Abstract:The rapid proliferation of large language models (LLMs) has created an urgent need for reliable methods to detect whether a text is generated by such models. In this paper, we propose SimMark, a posthoc watermarking algorithm that makes LLMs’ outputs traceable without requiring access to the model’s internal logits, enabling compatibility with a wide range of LLMs, including API-only models. By leveraging the similarity of semantic sentence embeddings and rejection sampling to impose detectable statistical patterns imperceptible to humans, and employing a soft counting mechanism, SimMark achieves robustness against paraphrasing attacks. Experimental results demonstrate that SimMark sets a new benchmark for robust watermarking of LLM-generated content, surpassing prior sentence-level watermarking techniques in robustness, sampling efficiency, and applicability across diverse domains, all while preserving the text quality.
zh

[NLP-51] wilight: Adaptive Attention Sparsity with Hierarchical Top-p Pruning

【速读】：该论文旨在解决利用固定预算的稀疏注意力（Sparse Attention）或键值（KV）缓存压缩算法在实际部署中无法适应动态场景，从而难以在准确性和效率之间找到最优平衡的问题。关键在于将top- p 采样（核采样）引入稀疏注意力机制，实现自适应预算分配，进而提出Twilight框架，能够在保持准确性的前提下，使任何现有的稀疏注意力算法具备自适应稀疏性。实证结果表明，Twilight能够自适应地剪枝最多98%的冗余标记，从而在长上下文大语言模型（LLM）解码中的自注意力操作加速15.4倍，并且端到端每标记延迟加速3.9倍。

链接: https://arxiv.org/abs/2502.02770
作者: Chaofan Lin,Jiaming Tang,Shuo Yang,Hanshuo Wang,Tian Tang,Boyu Tian,Ion Stoica,Song Han,Mingyu Gao
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top- p sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to 15.4\times acceleration in self-attention operations and 3.9\times acceleration in end-to-end per token latency in long context LLM decoding.
zh

[NLP-52] SmolLM 2: When Smol Goes Big – Data-Centric Training of a Small Language Model

【速读】：该论文旨在解决大型语言模型在资源受限环境中的部署难题。为实现这一目标，论文提出的关键解决方案是开发了一种名为SmolLM2的“小型”（1.7亿参数）语言模型。通过在约11万亿个token的数据集上进行多阶段训练，并引入新的专业化数据集（如FineMath、Stack-Edu和SmolTalk），同时结合小规模删减实验和手动优化过程，以动态调整各阶段的数据混合比率，最终证明SmolLM2在性能上超越了其他近期的小型语言模型，包括Qwen2.5-1.5B和Llama3.2-1B。

链接: https://arxiv.org/abs/2502.02737
作者: Loubna Ben Allal,Anton Lozhkov,Elie Bakouch,Gabriel Martín Blázquez,Guilherme Penedo,Lewis Tunstall,Andrés Marafioti,Hynek Kydlíček,Agustín Piqueres Lajarín,Vaibhav Srivastav,Joshua Lochner,Caleb Fahlgren,Xuan-Son Nguyen,Clémentine Fourrier,Ben Burtenshaw,Hugo Larcher,Haojun Zhao,Cyril Zakka,Mathieu Morlon,Colin Raffel,Leandro von Werra,Thomas Wolf
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art “small” (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.
zh

[NLP-53] Peri-LN: Revisiting Layer Normalization in the Transformer Architecture

【速读】：该论文旨在解决在大规模训练中选择最优层归一化（Layer Normalization, LN）策略以确保训练稳定性和加速收敛的问题。尽管Pre-LN和Post-LN长期占据主导地位，但它们在大规模训练中的局限性逐渐显现。论文的关键在于提出并深入分析了一种新的LN策略——周边归一化（Peri-LN），它将层归一化置于子层周围。研究发现，Peri-LN能够理想地平衡方差增长，避免了Pre-LN和Post-LN所导致的梯度消失和激活值过大的问题。通过大规模实验验证，Peri-LN展示了更均衡的方差增长、更稳定的梯度流动和更高的收敛稳定性。

链接: https://arxiv.org/abs/2502.02732
作者: Jeonghoon Kim,Byeongchan Lee,Cheonbok Park,Yeontaek Oh,Beomjun Kim,Taehwan Yoo,Seongjin Shin,Dongyoon Han,Jinwoo Shin,Kang Min Yoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Designing Transformer architectures with the optimal layer normalization (LN) strategy that ensures large-scale training stability and expedite convergence has remained elusive, even in this era of large language models (LLMs). To this end, we present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformer training. Until recently, Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. However, several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. This strategy places layer normalization (LN) peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising empirical performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis shows that Peri-LN strikes an ideal balance in variance growth – unlike Pre-LN and Post-LN, which are prone to vanishing gradients and ``massive activations.‘’ To validate our theoretical insight, we conduct large-scale experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement and application of LN.
zh

[NLP-54] Cross-Lingual Transfer for Low-Resource Natural Language Processing

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）在低资源语言上的应用难题，主要由于这些语言缺乏训练数据和计算资源。为了解决这一问题，论文聚焦于跨语言迁移学习，特别是通过改进的数据和模型基础方法来提升低资源语言的序列标注任务表现，如命名实体识别、意见目标抽取和论点挖掘。论文的关键解决方案包括：提出了一种新的基于数据的跨语言迁移方法T-Projection，它利用文本到文本的多语言模型和机器翻译系统显著提升了注释投影的效果；开发了一种约束解码算法，增强了零样本设置下的跨语言序列标注性能；以及构建了首个多语言文本到文本的医学模型Medical mT5，展示了研究在实际应用中的影响。

链接: https://arxiv.org/abs/2502.02722
作者: Iker García-Ferrero
机构: 未知
类目: Computation and Language (cs.CL)
备注: Doctoral Thesis: University of the Basque Country UPV/EHU

点击查看摘要

Abstract:Natural Language Processing (NLP) has seen remarkable advances in recent years, particularly with the emergence of Large Language Models that have achieved unprecedented performance across many tasks. However, these developments have mainly benefited a small number of high-resource languages such as English. The majority of languages still face significant challenges due to the scarcity of training data and computational resources. To address this issue, this thesis focuses on cross-lingual transfer learning, a research area aimed at leveraging data and models from high-resource languages to improve NLP performance for low-resource languages. Specifically, we focus on Sequence Labeling tasks such as Named Entity Recognition, Opinion Target Extraction, and Argument Mining. The research is structured around three main objectives: (1) advancing data-based cross-lingual transfer learning methods through improved translation and annotation projection techniques, (2) developing enhanced model-based transfer learning approaches utilizing state-of-the-art multilingual models, and (3) applying these methods to real-world problems while creating open-source resources that facilitate future research in low-resource NLP. More specifically, this thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method that leverages text-to-text multilingual models and machine translation systems. T-Projection significantly outperforms previous annotation projection methods by a wide margin. For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings using text-to-text models. Finally, we develop Medical mT5, the first multilingual text-to-text medical model, demonstrating the practical impact of our research on real-world applications. Comments: Doctoral Thesis: University of the Basque Country UPV/EHU Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.02722 [cs.CL] (or arXiv:2502.02722v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.02722 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-55] A Unified Understanding and Evaluation of Steering Methods

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）控制方法缺乏统一理解和一致评估的问题，这阻碍了相关领域的进展。论文的关键解决方案在于引入了一个统一框架，用于分析和评估这些控制方法（即转向方法，steering methods），并正式化其核心原则，提供理论见解以阐明其有效性。通过在多项选择和开放式文本生成任务上的全面实证评估，验证了这些见解，并确定了影响性能的关键因素，从而展示了某些方法的优越性。

链接: https://arxiv.org/abs/2502.02716
作者: Shawn Im,Yixuan Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Steering methods provide a practical approach to controlling large language models by applying steering vectors to intermediate activations, guiding outputs toward desired behaviors while avoiding retraining. Despite their growing importance, the field lacks a unified understanding and consistent evaluation across tasks and datasets, hindering progress. This paper introduces a unified framework for analyzing and evaluating steering methods, formalizing their core principles and offering theoretical insights into their effectiveness. Through comprehensive empirical evaluations on multiple-choice and open-ended text generation tasks, we validate these insights, identifying key factors that influence performance and demonstrating the superiority of certain methods. Our work bridges theoretical and practical perspectives, offering actionable guidance for advancing the design, optimization, and deployment of steering methods in LLMs.
zh

[NLP-56] Developing multilingual speech synthesis system for Ojibwe Mikmaq and Maliseet

【速读】：该论文旨在解决低资源语言的文本到语音（Text-to-Speech, TTS）系统开发问题，特别是在数据稀缺的情况下。关键解决方案在于开发了一种轻量级的多语言TTS系统，并采用了无需注意力机制的架构，这种架构在保持高性能的同时具有更高的内存效率。研究表明，使用三种类型相似的语言进行多语言模型训练可以提升性能，优于单一语言模型。

链接: https://arxiv.org/abs/2502.02703
作者: Shenran Wang,Changbing Yang,Mike Parkhill,Chad Quinn,Christopher Hammerly,Jian Zhu
机构: University of British Columbia(英属哥伦比亚大学); SayITFirst(未知); CultureFoundry(未知)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present lightweight flow matching multilingual text-to-speech (TTS) systems for Ojibwe, Mi’kmaq, and Maliseet, three Indigenous languages in North America. Our results show that training a multilingual TTS model on three typologically similar languages can improve the performance over monolingual models, especially when data are scarce. Attention-free architectures are highly competitive with self-attention architecture with higher memory efficiency. Our research not only advances technical development for the revitalization of low-resource languages but also highlights the cultural gap in human evaluation protocols, calling for a more community-centered approach to human evaluation.
zh

[NLP-57] How Inclusively do LMs Perceive Social and Moral Norms? NAACL2025

【速读】：该论文旨在探讨语言模型（Language Models, LMs）在感知社会和道德规范方面是否具有包容性，特别是在不同人口群体（如性别、年龄和收入）中的表现。论文的关键解决方案在于引入了一种名为绝对距离对齐度量（Absolute Distance Alignment Metric, ADA-Met）的方法，用于量化语言模型在序数问题上的对齐情况，并通过比较11个语言模型与100个人类注释者对准则提示（rules-of-thumb, RoTs）的反应来评估其包容性。研究发现，年轻且高收入群体的语言模型响应更接近人类注释者的观点，这引发了对边缘化视角表达不足的关注。研究表明，进一步努力使语言模型更加包容多元的人类价值观至关重要。

链接: https://arxiv.org/abs/2502.02696
作者: Michael Galarnyk,Agam Shah,Dipanwita Guhathakurta,Poojitha Nandigam,Sudheer Chava
机构: Georgia Institute of Technology
类目: Computation and Language (cs.CL)
备注: Accepted at NAACL 2025 Findings

点击查看摘要

Abstract:This paper discusses and contains offensive content. Language models (LMs) are used in decision-making systems and as interactive assistants. However, how well do these models making judgements align with the diversity of human values, particularly regarding social and moral norms? In this work, we investigate how inclusively LMs perceive norms across demographic groups (e.g., gender, age, and income). We prompt 11 LMs on rules-of-thumb (RoTs) and compare their outputs with the existing responses of 100 human annotators. We introduce the Absolute Distance Alignment Metric (ADA-Met) to quantify alignment on ordinal questions. We find notable disparities in LM responses, with younger, higher-income groups showing closer alignment, raising concerns about the representation of marginalized perspectives. Our findings highlight the importance of further efforts to make LMs more inclusive of diverse human values. The code and prompts are available on GitHub under the CC BY-NC 4.0 license.
zh

[NLP-58] Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

【速读】：该论文旨在解决流式多说话人语音翻译任务中的两个关键问题：实时检测说话人变化及识别说话人性别。为实现这一目标，论文提出将说话人嵌入（speaker embeddings）融入基于转换器的端到端流式语音翻译模型中。这种方法的关键在于利用说话人嵌入信息，以实现高精度的说话人变化检测和性别分类。

链接: https://arxiv.org/abs/2502.02683
作者: Peidong Wang,Naoyuki Kanda,Jian Xue,Jinyu Li,Xiaofei Wang,Aswin Shanmugam Subramanian,Junkun Chen,Sunit Sivasankaran,Xiong Xiao,Yong Zhao
机构: Microsoft (微软)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker’s gender is. Speaker change information can be used to create audio prompts for a zero-shot text-to-speech system, and gender can help to select speaker profiles in a conventional text-to-speech model. We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model. Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.
zh

[NLP-59] ransformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

【速读】：该论文旨在解决在不同规模的数据集上，大型语言模型（Large Language Models, LLMs）和TabPFN在小数据集上表现优异但在中大数据集上不如梯度提升决策树（Gradient-Boosted Decision Trees, GBDTs）的问题。解决方案的关键在于提出了一种融合方法，即通过LLM-Boost和PFN-Boost将LLMs与TabPFN分别与GBDTs结合，使GBDTs能够利用变压器模型的自然语言处理能力和预训练优势，从而在广泛的数据集规模范围内实现超越单一组件的性能。

链接: https://arxiv.org/abs/2502.02672
作者: Mayuka Jayawardhana(1),Renbo Tu(2),Samuel Dooley(3),Valeriia Cherepanova(4),Andrew Gordon Wilson(5),Frank Hutter(6),Colin White(7),Tom Goldstein(1),Micah Goldblum(8) ((1) University of Maryland, (2) University of Toronto, (3) Meta, (4) Amazon, (5) New York University, (6) University of Freiburg, (7) a href=“http://Abacus.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a, (8) Columbia University)
机构: University of Maryland(马里兰大学); University of Toronto(多伦多大学); Meta(Meta); Amazon(亚马逊); New York University(纽约大学); University of Freiburg(弗莱堡大学); Abacus.AI; Columbia University(哥伦比亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at this http URL .
zh

[NLP-60] On Teacher Hacking in Language Model Distillation

【速读】：该论文旨在探讨知识蒸馏过程中是否会出现类似于奖励黑客行为的现象，即“教师黑客”行为。论文的关键解决方案在于通过引入一个受控实验设置，包括一个代表真实分布的oracle语言模型、从oracle蒸馏出的教师语言模型以及从教师模型蒸馏出的学生语言模型。研究发现，使用固定的离线数据集进行蒸馏时，教师黑客现象确实会发生，并且可以通过观察优化过程是否偏离多项式收敛规律来检测。相反，采用在线数据生成技术能够有效缓解教师黑客现象，其中数据多样性被确定为防止黑客行为的关键因素。

链接: https://arxiv.org/abs/2502.02671
作者: Daniil Tiapkin,Daniele Calandriello,Johan Ferret,Sarah Perrin,Nino Vieillard,Alexandre Ramé,Mathieu Blondel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart’s law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.
zh

[NLP-61] A Training-Free Length Extrapolation Approach for LLM s: Greedy Attention Logit Interpolation (GALI)

【速读】：该论文旨在解决Transformer-based大型语言模型（LLMs）在处理超出其训练上下文窗口长度的输入时所面临的性能下降问题。这一问题源于位置分布外（O.O.D.）现象，影响注意力计算。论文的关键解决方案是提出了一种名为贪心注意力对数插值（GALI）的方法，这是一种无需训练的长度外推方法，通过最大化利用预训练的位置间隔，并通过注意力对数插值避免注意力对数异常值。研究表明，GALI方法在超越现有最先进的无需训练的方法的同时，揭示了LLMs在其训练上下文窗口内非均匀地解释位置间隔，表明在较小的位置间隔范围内进行外推可以取得更好的效果，即使对于短上下文任务也是如此。GALI代表了解决位置O.O.D.挑战的重要进展，使LLMs能够更可靠地理解长文本。

链接: https://arxiv.org/abs/2502.02659
作者: Yan Li,Tianyi Zhang,Zechuan Li,Soyeon Caren Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, under review in the conference

点击查看摘要

Abstract:Transformer-based Large Language Models (LLMs) struggle to process inputs exceeding their training context window, with performance degrading due to positional out-of-distribution (O.O.D.) that disrupt attention computations. Existing solutions, fine-tuning and training-free methods, are limited by computational inefficiency, attention logit outliers or loss of local positional information. To address this, we propose Greedy Attention Logit Interpolation (GALI), a training-free length extrapolation method that maximizes the utilization of pretrained positional intervals while avoiding attention logit outliers through attention logit interpolation. The result demonstrates that GALI consistently outperforms state-of-the-art training-free methods. Our findings reveal that LLMs interpret positional intervals unevenly within their training context window, suggesting that extrapolating within a smaller positional interval range yields superior results-even for short-context tasks. GALI represents a significant step toward resolving the positional O.O.D. challenge, enabling more reliable long-text understanding in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at this https URL.
zh

[NLP-62] ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization

【速读】：该论文旨在解决不同位宽量化模型在精度与模型大小之间的最佳权衡问题。论文的关键在于提出了一种名为ParetoQ的统一框架，该框架能够支持对1位、1.58位、2位、3位和4位量化设置进行严格的比较。研究发现，2位到3位之间存在显著的学习转变：3位及以上的精细调整模型保持接近原始预训练分布，而2位及以下的网络表示则发生大幅变化。通过优化训练方案和改进量化函数，ParetoQ超越了之前针对特定位宽的所有方法，并且展示了三值、2位和3位量化在大小-精度权衡中的性能相当，通常优于4位和二值量化。

链接: https://arxiv.org/abs/2502.02631
作者: Zechun Liu,Changsheng Zhao,Hanxian Huang,Sijia Chen,Jing Zhang,Jiawei Zhao,Scott Roy,Lisa Jin,Yunyang Xiong,Yangyang Shi,Lin Xiao,Yuandong Tian,Bilge Soran,Raghuraman Krishnamoorthi,Tijmen Blankevoort,Vikas Chandra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
zh

[NLP-63] SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

【速读】：该论文旨在解决语音大型语言模型（SLLMs）在检索增强生成（RAG）技术中的高延迟和错误传播问题。传统方法依赖于自动语音识别（ASR）与基于文本的检索相结合的两阶段处理流程，导致上述局限性。论文的关键解决方案在于提出一个统一的嵌入框架，该框架摒弃了中间文本表示的需求，通过分别使用语音和文本编码器，并通过共享的缩放层将两种模态映射到公共嵌入空间中，从而显著降低了管道延迟并提高了检索准确性。

链接: https://arxiv.org/abs/2502.02603
作者: Chunyu Sun,Bingyu Liu,Zhichao Cui,Anbin Qi,Tian-hao Zhang,Dinghao Zhou,Lewei Lu
机构: SenseTime Research (商汤科技研究部)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a theoretical analysis of the challenges inherent in end-to-end speech retrieval and introduce architectural principles for effective speech-to-document matching. Extensive experiments demonstrate the robustness of our approach across diverse acoustic conditions and speaker variations, paving the way for a new paradigm in multimodal SLLMs retrieval systems.
zh

计算机视觉

[CV-0] Seeing World Dynamics in a Nutshell

【速读】：该论文旨在高效表示随意捕捉的单目视频，并保持空间和时间上的连贯性。现有方法主要依赖于二维或二维半技术，将视频视为时空像素的集合，难以处理复杂运动、遮挡和几何一致性问题，因为缺乏时间连贯性和显式的三维结构。论文的关键在于提出NutWorld框架，通过时空对齐的高斯（Gaussian）原始体素流来表示视频的内在三维形式，实现了无需优化的场景建模，并通过有效的深度和流正则化增强了模型效果。

链接: https://arxiv.org/abs/2502.03465
作者: Qiuhong Shen,Xuanyu Yi,Mingbao Lin,Hanwang Zhang,Shuicheng Yan,Xinchao Wang
机构: National University of Singapore; Nanyang Technological University; Skywork AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at this https URL.
zh

[CV-1] SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

【速读】：该论文旨在解决现有活动日常生活中（ADL）视频理解模型在处理相似外观、细微动作模式和多视角挑战时的局限性，并提升其在未见过的动作类别上的泛化能力。关键解决方案在于引入SKI模型，通过SkeletonCLIP将3D骨骼信息融入到视觉-语言嵌入空间中，从而实现与视觉-语言模型（VLMs）和大规模视觉-语言模型（LVLMs）的协同训练。这种集成方法不仅提升了模型的泛化性能，还确保了在推理阶段无需骨骼数据，增强了模型的实用性。

链接: https://arxiv.org/abs/2502.03459
作者: Arkaprava Sinha,Dominick Reilly,Francois Bremond,Pu Wang,Srijan Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
zh

[CV-2] Dress-1-to-3: Single Image to Simulation-Ready 3D Outfit with Diffusion Prior and Differentiable Physics

【速读】：该论文旨在解决从单张图像生成可分离且适用于模拟的三维服装模型的问题。现有方法通常生成单一整体模型，限制了其在下游任务中的应用。论文的关键解决方案在于引入了Dress-1-to-3管道，该管道通过结合预训练的图像到缝制图案生成模型和多视角扩散模型，生成粗略的缝制图案和多视角图像。进一步利用可微分的服装模拟器优化缝制图案，并通过纹理生成模块和人体运动生成模块，最终生成物理逼真且真实的动态服装演示。

链接: https://arxiv.org/abs/2502.03449
作者: Xuan Li,Chang Yu,Wenxin Du,Ying Jiang,Tianyi Xie,Yunuo Chen,Yin Yang,Chenfanfu Jiang
机构: UCLA; University of Utah
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in large models have significantly advanced image-to-3D reconstruction. However, the generated models are often fused into a single piece, limiting their applicability in downstream tasks. This paper focuses on 3D garment generation, a key area for applications like virtual try-on with dynamic garment animations, which require garments to be separable and simulation-ready. We introduce Dress-1-to-3, a novel pipeline that reconstructs physics-plausible, simulation-ready separated garments with sewing patterns and humans from an in-the-wild image. Starting with the image, our approach combines a pre-trained image-to-sewing pattern generation model for creating coarse sewing patterns with a pre-trained multi-view diffusion model to produce multi-view images. The sewing pattern is further refined using a differentiable garment simulator based on the generated multi-view images. Versatile experiments demonstrate that our optimization approach substantially enhances the geometric alignment of the reconstructed 3D garments and humans with the input image. Furthermore, by integrating a texture generation module and a human motion generation module, we produce customized physics-plausible and realistic dynamic garment demonstrations. Project page: this https URL
zh

[CV-3] Masked Autoencoders Are Effective Tokenizers for Diffusion Models

【速读】：该论文旨在解决高分辨率图像合成中潜在扩散模型的潜在空间结构问题。论文的关键在于提出了一种名为MAETok的新方法，该方法利用掩模建模学习语义丰富的潜在空间，同时保持重建保真度。研究表明，潜在空间的结构而非变分约束对于有效的扩散模型至关重要。通过采用非变分自编码器形式的MAETok，实现了在ImageNet生成任务上的最新性能，并显著提升了训练速度和推理吞吐量。

链接: https://arxiv.org/abs/2502.03444
作者: Hao Chen,Yujin Han,Fangyi Chen,Xiang Li,Yidong Wang,Jindong Wang,Ze Wang,Zicheng Liu,Difan Zou,Bhiksha Raj
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.
zh

[CV-4] A Temporal Convolutional Network-Based Approach and a Benchmark Dataset for Colonoscopy Video Temporal Segmentation

【速读】：该论文旨在解决结肠镜视频自主分割成解剖部位和操作阶段的问题。解决方案的关键在于提出了一种基于学习的架构ColonTCN，它采用了自定义的时间卷积块，能够有效地捕捉长时序依赖关系，从而实现结肠镜视频的时间分割。此外，论文创建了一个开放获取的数据集，并提出了双重k折交叉验证评估协议，以确保模型在未见过的多中心数据上的性能。

链接: https://arxiv.org/abs/2502.03430
作者: Carlo Biffi,Giorgio Roffo,Pietro Salvagnini,Andrea Cherubini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Following recent advancements in computer-aided detection and diagnosis systems for colonoscopy, the automated reporting of colonoscopy procedures is set to further revolutionize clinical practice. A crucial yet underexplored aspect in the development of these systems is the creation of computer vision models capable of autonomously segmenting full-procedure colonoscopy videos into anatomical sections and procedural phases. In this work, we aim to create the first open-access dataset for this task and propose a state-of-the-art approach, benchmarked against competitive models. We annotated the publicly available REAL-Colon dataset, consisting of 2.7 million frames from 60 complete colonoscopy videos, with frame-level labels for anatomical locations and colonoscopy phases across nine categories. We then present ColonTCN, a learning-based architecture that employs custom temporal convolutional blocks designed to efficiently capture long temporal dependencies for the temporal segmentation of colonoscopy videos. We also propose a dual k-fold cross-validation evaluation protocol for this benchmark, which includes model assessment on unseen, multi-center this http URL achieves state-of-the-art performance in classification accuracy while maintaining a low parameter count when evaluated using the two proposed k-fold cross-validation settings, outperforming competitive models. We report ablation studies to provide insights into the challenges of this task and highlight the benefits of the custom temporal convolutional blocks, which enhance learning and improve model efficiency. We believe that the proposed open-access benchmark and the ColonTCN approach represent a significant advancement in the temporal segmentation of colonoscopy procedures, fostering further open-access research to address this clinical need.
zh

[CV-5] ruePose: Human-Parsing-guided Attention Diffusion for Full-ID Preserving Pose Transfer

【速读】：该论文旨在解决姿态引导的人物图像合成（Pose-Guided Person Image Synthesis, PGPIS）方法在保持面部特征的同时难以准确保留源图像中的服装细节的问题，特别是在源姿势与目标姿势存在显著差异的情况下。论文的关键解决方案是提出了一种基于人体解析的人体注意力扩散方法（human-parsing-guided attention diffusion），通过引入一个包含三个主要组件的Siamese网络：双UNet架构（用于去噪的目标网络和提取源图像嵌入的源网络）、基于人体解析的融合注意力模块（HPFA）以及CLIP指导的注意力对齐模块（CAA）。这些模块能够自适应且有效地将面部和衣物模式嵌入到目标图像生成过程中，从而有效改善了面部和衣物外观的保留效果。

链接: https://arxiv.org/abs/2502.03426
作者: Zhihong Xu,Dongxia Wang,Peng Du,Yang Cao,Qing Guo
机构: Zhejiang University(浙江大学); Alibaba Group(阿里巴巴集团); Institute of High Performance Computing (IHPC)(高性能计算研究所), ASTAR, Singapore; Centre for Frontier AI Research (CFAR)(前沿人工智能研究中心), ASTAR, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pose-Guided Person Image Synthesis (PGPIS) generates images that maintain a subject’s identity from a source image while adopting a specified target pose (e.g., skeleton). While diffusion-based PGPIS methods effectively preserve facial features during pose transformation, they often struggle to accurately maintain clothing details from the source image throughout the diffusion process. This limitation becomes particularly problematic when there is a substantial difference between the source and target poses, significantly impacting PGPIS applications in the fashion industry where clothing style preservation is crucial for copyright protection. Our analysis reveals that this limitation primarily stems from the conditional diffusion model’s attention modules failing to adequately capture and preserve clothing patterns. To address this limitation, we propose human-parsing-guided attention diffusion, a novel approach that effectively preserves both facial and clothing appearance while generating high-quality results. We propose a human-parsing-aware Siamese network that consists of three key components: dual identical UNets (TargetNet for diffusion denoising and SourceNet for source image embedding extraction), a human-parsing-guided fusion attention (HPFA), and a CLIP-guided attention alignment (CAA). The HPFA and CAA modules can embed the face and clothes patterns into the target image generation adaptively and effectively. Extensive experiments on both the in-shop clothes retrieval benchmark and the latest in-the-wild human editing dataset demonstrate our method’s significant advantages over 13 baseline approaches for preserving both facial and clothes appearance in the source image.
zh

[CV-6] Concept Based Explanations and Class Contrasting

【速读】：该论文旨在解决深度神经网络预测解释的挑战，特别是对于个体类别预测及类别对比解释的需求。关键解决方案在于引入了一种基于概念的解释方法，能够有效地解释模型为何预测某一特定类别而不是其他类别。通过在多个公开可用的ImageNet1K分类模型以及一个用于检测染色组织切片中肿瘤的分割模型上的测试，验证了该方法的有效性，包括定性和定量测试。

链接: https://arxiv.org/abs/2502.03422
作者: Rudolf Herdt,Daniel Otero Baguer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Explaining deep neural networks is challenging, due to their large size and non-linearity. In this paper, we introduce a concept-based explanation method, in order to explain the prediction for an individual class, as well as contrasting any two classes, i.e. explain why the model predicts one class over the other. We test it on several openly available classification models trained on ImageNet1K, as well as on a segmentation model trained to detect tumor in stained tissue samples. We perform both qualitative and quantitative tests. For example, for a ResNet50 model from pytorch model zoo, we can use the explanation for why the model predicts a class ‘A’ to automatically select six dataset crops where the model does not predict class ‘A’. The model then predicts class ‘A’ again for the newly combined image in 71% of the cases (works for 710 out of the 1000 classes). The code including an .ipynb example is available on git: this https URL.
zh

[CV-7] Can Text-to-Image Generative Models Accurately Depict Age? A Comparative Study on Synthetic Portrait Generation and Age Estimation

【速读】：该论文旨在评估文本到图像生成模型在创建合成肖像时对不同人口统计属性（包括年龄、国籍和性别）的表征准确性。研究通过使用详细描述性提示（例如，“一张32岁加拿大男性的逼真自拍照”），覆盖广泛的212种国籍、30个从10岁到78岁的年龄段，并保持性别平衡。论文的关键解决方案在于通过与两个已建立的年龄估计模型的结果对比，来评估生成图像中年龄表征的准确性。研究发现，尽管这些模型能够一致地生成反映不同身份的面部图像，但它们在不同人口统计背景下准确捕捉特定年龄的能力仍然存在显著差异。因此，当前合成数据可能不足以支持需要高度精确度的高风险年龄相关任务，除非进行大量的筛选和整理工作。

链接: https://arxiv.org/abs/2502.03420
作者: Alexey A. Novikov,Miroslav Vranka,François David,Artem Voronin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.
zh

[CV-8] Deep Clustering via Probabilistic Ratio-Cut Optimization AISTATS

【速读】：该论文旨在优化图的比率切割（Ratio-Cut）问题，提出了一种新颖的方法，将二元分配建模为随机变量。解决方案的关键在于提供比率切割的预期上限以及其梯度的无偏估计，从而在在线设置下学习分配变量的参数。这种方法（PRCut）在聚类性能上超越了组合问题的瑞利商松弛及其在线学习扩展方法，展示了与相似性度量的良好一致性，并能够在提供基于标签的相似性时表现得如同有监督分类器一般。

链接: https://arxiv.org/abs/2502.03405
作者: Ayoub Ghriss,Claire Monteleoni
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025, Mai Khao, Thailand. PMLR: Volume 258

点击查看摘要

Abstract:We propose a novel approach for optimizing the graph ratio-cut by modeling the binary assignments as random variables. We provide an upper bound on the expected ratio-cut, as well as an unbiased estimate of its gradient, to learn the parameters of the assignment variables in an online setting. The clustering resulting from our probabilistic approach (PRCut) outperforms the Rayleigh quotient relaxation of the combinatorial problem, its online learning extensions, and several widely used methods. We demonstrate that the PRCut clustering closely aligns with the similarity measure and can perform as well as a supervised classifier when label-based similarities are provided. This novel approach can leverage out-of-the-box self-supervised representations to achieve competitive performance and serve as an evaluation method for the quality of these representations.
zh

[CV-9] Ethical Considerations for the Military Use of Artificial Intelligence in Visual Reconnaissance

【速读】：该论文旨在解决在军事场景中负责任地部署人工智能（Artificial Intelligence, AI）的问题。解决方案的关键在于建立一个基于伦理原则的框架，特别强调公平性（Fairness）、问责制（Accountability）、透明度（Transparency）和伦理（Ethics），即FATE指南。此外，论文还引入了针对军事AI应用特有的伦理考量，包括可追溯性（Traceability）、比例性（Proportionality）、可控性（Governability）、责任性和可靠性（Reliability）。通过这三个领域的具体应用案例，结合自动化传感器数据分析、可解释AI（eXplainable AI, XAI）及直观用户体验等方法，确保这些伦理原则能够应用于现实世界的情景中。

链接: https://arxiv.org/abs/2502.03376
作者: Mathias Anneken,Nadia Burkart,Fabian Jeschke,Achim Kuwertz-Wolf,Almuth Mueller,Arne Schumann,Michael Teutsch
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: White Paper, 30 pages, 7 figures

点击查看摘要

Abstract:This white paper underscores the critical importance of responsibly deploying Artificial Intelligence (AI) in military contexts, emphasizing a commitment to ethical and legal standards. The evolving role of AI in the military goes beyond mere technical applications, necessitating a framework grounded in ethical principles. The discussion within the paper delves into ethical AI principles, particularly focusing on the Fairness, Accountability, Transparency, and Ethics (FATE) guidelines. Noteworthy considerations encompass transparency, justice, non-maleficence, and responsibility. Importantly, the paper extends its examination to military-specific ethical considerations, drawing insights from the Just War theory and principles established by prominent entities. In addition to the identified principles, the paper introduces further ethical considerations specifically tailored for military AI applications. These include traceability, proportionality, governability, responsibility, and reliability. The application of these ethical principles is discussed on the basis of three use cases in the domains of sea, air, and land. Methods of automated sensor data analysis, eXplainable AI (XAI), and intuitive user experience are utilized to specify the use cases close to real-world scenarios. This comprehensive approach to ethical considerations in military AI reflects a commitment to aligning technological advancements with established ethical frameworks. It recognizes the need for a balance between leveraging AI’s potential benefits in military operations while upholding moral and legal standards. The inclusion of these ethical principles serves as a foundation for responsible and accountable use of AI in the complex and dynamic landscape of military scenarios.
zh

[CV-10] Deep Learning-Based Approach for Identification of Potato Leaf Diseases Using Wrapper Feature Selection and Feature Concatenation

【速读】：该论文旨在解决马铃薯叶片晚疫病的早期检测问题。解决方案的关键在于采用基于图像处理和机器学习的方法，通过四个阶段实现高精度的疾病识别：首先利用直方图均衡化改善输入图像质量；接着使用深度卷积神经网络（Deep CNN）模型进行特征提取并拼接；随后采用基于包裹式（wrapper-based）的特征选择方法；最后利用支持向量机（SVM）分类器及其变体进行分类，从而实现了高达99%的准确率。

链接: https://arxiv.org/abs/2502.03370
作者: Muhammad Ahtsam Naeem,Muhammad Asim Saleem,Muhammad Imran Sharif,Shahzad Akber,Sajjad Saleem,Zahid Akhtar,Kamran Siddique
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The potato is a widely grown crop in many regions of the world. In recent decades, potato farming has gained incredible traction in the world. Potatoes are susceptible to several illnesses that stunt their development. This plant seems to have significant leaf disease. Early Blight and Late Blight are two prevalent leaf diseases that affect potato plants. The early detection of these diseases would be beneficial for enhancing the yield of this crop. The ideal solution is to use image processing to identify and analyze these disorders. Here, we present an autonomous method based on image processing and machine learning to detect late blight disease affecting potato leaves. The proposed method comprises four different phases: (1) Histogram Equalization is used to improve the quality of the input image; (2) feature extraction is performed using a Deep CNN model, then these extracted features are concatenated; (3) feature selection is performed using wrapper-based feature selection; (4) classification is performed using an SVM classifier and its variants. This proposed method achieves the highest accuracy of 99% using SVM by selecting 550 features.
zh

[CV-11] GHOST: Gaussian Hypothesis Open-Set Technique AAAI

【速读】：该论文旨在解决大规模开放集识别（Open-Set Recognition, OSR）中的公平性问题，特别是在整体性能评估中未能揭示单个类别性能差异的问题。论文的关键解决方案是引入了一种无需调节超参数的新型算法——高斯假设开放集技术（Gaussian Hypothesis Open Set Technique, GHOST）。GHOST通过使用类别特定的多元高斯分布（各向异性协方差矩阵）来建模深度特征，并应用Z分数归一化处理logits，从而减轻特征幅值偏离模型预期的影响，降低网络将未知样本赋予高分的可能性。这一方法显著提升了在多个ImageNet-1K预训练深度网络上的表现，并在不同未知数据集测试中实现了统计意义上的改进，推动了大型开放集识别领域的前沿进展。

链接: https://arxiv.org/abs/2502.03359
作者: Ryan Rabinowitz,Steve Cruz,Manuel Günther,Terrance E. Boult
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAAI Conference on Artificial Intelligence 2025

点击查看摘要

Abstract:Evaluations of large-scale recognition methods typically focus on overall performance. While this approach is common, it often fails to provide insights into performance across individual classes, which can lead to fairness issues and misrepresentation. Addressing these gaps is crucial for accurately assessing how well methods handle novel or unseen classes and ensuring a fair evaluation. To address fairness in Open-Set Recognition (OSR), we demonstrate that per-class performance can vary dramatically. We introduce Gaussian Hypothesis Open Set Technique (GHOST), a novel hyperparameter-free algorithm that models deep features using class-wise multivariate Gaussian distributions with diagonal covariance matrices. We apply Z-score normalization to logits to mitigate the impact of feature magnitudes that deviate from the model’s expectations, thereby reducing the likelihood of the network assigning a high score to an unknown sample. We evaluate GHOST across multiple ImageNet-1K pre-trained deep networks and test it with four different unknown datasets. Using standard metrics such as AUOSCR, AUROC and FPR95, we achieve statistically significant improvements, advancing the state-of-the-art in large-scale OSR. Source code is provided online.
zh

[CV-12] RadVLM: A Multitask Conversational Vision-Language Model for Radiology

【速读】：该论文旨在解决自动化胸部X光（Chest X-ray, CXR）分析及报告生成中的交互诊断能力不足的问题。解决方案的关键在于提出RadVLM，这是一种专门设计用于CXR解读的紧凑型多任务对话基础模型。通过构建包含超过100万幅图像与指令对的大规模指令数据集，并进行微调，RadVLM不仅在对话能力和视觉定位方面达到最先进的性能，还在其他放射学任务中保持竞争力。

链接: https://arxiv.org/abs/2502.03333
作者: Nicolas Deperrois,Hidetoshi Matsuo,Samuel Ruipérez-Campillo,Moritz Vandenhirtz,Sonia Laguna,Alain Ryser,Koji Fujimoto,Mizuho Nishio,Thomas M. Sutter,Julia E. Vogt,Jonas Kluckert,Thomas Frauenfelder,Christian Blüthgen,Farhad Nooralahzadeh,Michael Krauthammer
机构: Department of Radiology, Kobe University (神户大学), Kobe, Japan; Department of Computer Science, ETH Zurich (苏黎世联邦理工学院), Zurich, Switzerland; Department of Advanced Imaging in Medical Magnetic Resonance, Kyoto University (京都大学), Kyoto, Japan; Department of Quantitative Biomedicine, University of Zurich (苏黎世大学), Zurich, Switzerland; Diagnostic and Interventional Radiology, University Hospital Zurich (苏黎世大学医院), Zurich, Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks – such as report generation, abnormality classification, and visual grounding – and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.
zh

[CV-13] Controllable GUI Exploration

【速读】：该论文旨在解决在界面设计初期，设计师需要生成多个草图以探索设计空间的问题，而现有的设计工具因过度强调细节而无法有效支持这一过程。尽管生成式AI (Generative AI) 的进展带来了一线希望，但在实际应用中，由于难以将模糊的想法转化为提示，这一方法也未能成功。论文提出的关键解决方案是一种基于扩散的方法，用于低效生成界面草图。这种方法通过三种类型的输入（提示、线框图和视觉流程）提供灵活的控制，使得设计师可以根据需要以任意详细程度组合这些输入，并获得多样化且低保真度的设计方案。其独特优势在于能够以极小的输入成本快速探索大型设计空间。

链接: https://arxiv.org/abs/2502.03330
作者: Aryan Garg,Yue Jiang,Antti Oulasvirta
机构: Aalto University (阿尔托大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:During the early stages of interface design, designers need to produce multiple sketches to explore a design space. Design tools often fail to support this critical stage, because they insist on specifying more details than necessary. Although recent advances in generative AI have raised hopes of solving this issue, in practice they fail because expressing loose ideas in a prompt is impractical. In this paper, we propose a diffusion-based approach to the low-effort generation of interface sketches. It breaks new ground by allowing flexible control of the generation process via three types of inputs: A) prompts, B) wireframes, and C) visual flows. The designer can provide any combination of these as input at any level of detail, and will get a diverse gallery of low-fidelity solutions in response. The unique benefit is that large design spaces can be explored rapidly with very little effort in input-specification. We present qualitative results for various combinations of input specifications. Additionally, we demonstrate that our model aligns more accurately with these specifications than other models.
zh

[CV-14] MAP Image Recovery with Guarantees using Locally Convex Multi-Scale Energy (LC-MUSE) Model

【速读】：本文旨在解决图像逆问题中的解的唯一性、收敛性和输入扰动鲁棒性等问题。关键在于提出了一种多尺度局部凸能量模型（Locally Convex Multi-Scale Energy, LC-MuSE），通过限制卷积神经网络（CNN）梯度的局部单调性，确保模型在数据流形附近的强凸性，从而实现上述性质。

链接: https://arxiv.org/abs/2502.03302
作者: Jyothi Rikhab Chand,Mathews Jacob
机构: Department of Electrical and Computer Engineering, University of Iowa (电气与计算机工程系，爱荷华大学), USA; Department of Electrical and Computer Engineering, University of Virginia (电气与计算机工程系，弗吉尼亚大学), USA
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We propose a multi-scale deep energy model that is strongly convex in the local neighbourhood around the data manifold to represent its probability density, with application in inverse problems. In particular, we represent the negative log-prior as a multi-scale energy model parameterized by a Convolutional Neural Network (CNN). We restrict the gradient of the CNN to be locally monotone, which constrains the model as a Locally Convex Multi-Scale Energy (LC-MuSE). We use the learned energy model in image-based inverse problems, where the formulation offers several desirable properties: i) uniqueness of the solution, ii) convergence guarantees to a minimum of the inverse problem, and iii) robustness to input perturbations. In the context of parallel Magnetic Resonance (MR) image reconstruction, we show that the proposed method performs better than the state-of-the-art convex regularizers, while the performance is comparable to plug-and-play regularizers and end-to-end trained methods.
zh

[CV-15] Conditional Prediction by Simulation for Automated Driving WWW

【速读】：该论文旨在解决模块化自动驾驶系统中预测与规划任务分离导致无法实现协同操作的问题。关键在于引入了一种能够建模轨迹间条件依赖关系的预测模型，通过微观交通仿真生成预测，其中交通参与者的行为由对抗逆强化学习训练的现实行为模型控制。该方法允许在预测过程中动态调整候选轨迹，从而实现更有效的协同规划。

链接: https://arxiv.org/abs/2502.03286
作者: Fabian Konstantinidis,Moritz Sackmann,Ulrich Hofmann,Christoph Stiller
机构: CARIAD SE(卡里亚德SE); Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at “16. Uni-DAS e.V. Workshop Fahrerassistenz und automatisiertes Fahren”. Link: this https URL

点击查看摘要

Abstract:Modular automated driving systems commonly handle prediction and planning as sequential, separate tasks, thereby prohibiting cooperative maneuvers. To enable cooperative planning, this work introduces a prediction model that models the conditional dependencies between trajectories. For this, predictions are generated by a microscopic traffic simulation, with the individual traffic participants being controlled by a realistic behavior model trained via Adversarial Inverse Reinforcement Learning. By assuming various candidate trajectories for the automated vehicle, we generate predictions conditioned on each of them. Furthermore, our approach allows the candidate trajectories to adapt dynamically during the prediction rollout. Several example scenarios are available at this https URL.
zh

[CV-16] Deep Learning-based Event Data Coding: A Joint Spatiotemporal and Polarity Solution

【速读】：该论文旨在解决高动态范围和低延迟应用场景下，事件相机产生的大规模像素级事件数据高效编码的问题。现有方案主要集中在无损编码，而本文提出了一种新颖的基于深度学习的联合事件数据编码（DL-JEC）方案，采用单点云表示方法，能够利用时空信息与极性信息之间的相关性。关键在于其通过自适应体素二值化策略和深度学习模型，实现了显著优于传统及现有深度学习方法的压缩性能，并且在降低码率的情况下，仍能保持目标计算机视觉任务，尤其是事件分类任务的性能。

链接: https://arxiv.org/abs/2502.03285
作者: Abdelrahman Seleem(1, 2, 3),André F. R. Guarda(2),Nuno M. M. Rodrigues(2, 4),Fernando Pereira(1, 2) ((1) Instituto Superior Técnico - Universidade de Lisboa, Lisbon, Portugal, (2) Instituto de Telecomunicações, Portugal, (3) Faculty of Computers and Information, South Valley University, Qena, Egypt, (4) ESTG, Politécnico de Leiria, Leiria, Portugal)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Neuromorphic vision sensors, commonly referred to as event cameras, have recently gained relevance for applications requiring high-speed, high dynamic range and low-latency data acquisition. Unlike traditional frame-based cameras that capture 2D images, event cameras generate a massive number of pixel-level events, composed by spatiotemporal and polarity information, with very high temporal resolution, thus demanding highly efficient coding solutions. Existing solutions focus on lossless coding of event data, assuming that no distortion is acceptable for the target use cases, mostly including computer vision tasks. One promising coding approach exploits the similarity between event data and point clouds, thus allowing to use current point cloud coding solutions to code event data, typically adopting a two-point clouds representation, one for each event polarity. This paper proposes a novel lossy Deep Learning-based Joint Event data Coding (DL-JEC) solution adopting a single-point cloud representation, thus enabling to exploit the correlation between the spatiotemporal and polarity event information. DL-JEC can achieve significant compression performance gains when compared with relevant conventional and DL-based state-of-the-art event data coding solutions. Moreover, it is shown that it is possible to use lossy event data coding with its reduced rate regarding lossless coding without compromising the target computer vision task performance, notably for event classification. The use of novel adaptive voxel binarization strategies, adapted to the target task, further enables DL-JEC to reach a superior performance.
zh

[CV-17] When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning

【速读】：该论文旨在解决预训练视觉表示（PVRs）在策略学习中的关键挑战，包括时间纠缠和场景微小变化导致的泛化能力下降。这些问题限制了PVRs在需要时间感知和应对场景变化的任务中的表现。论文的关键解决方案在于增强PVR特征的时间感知能力和任务完成感，以有效解纠缠时间信息，并引入一个模块来选择性关注与任务相关的局部特征，从而提高在分布外场景中的鲁棒性。这些改进显著提升了性能，特别是在采用掩码目标训练的PVRs中。

链接: https://arxiv.org/abs/2502.03270
作者: Nikolaos Tsagkas,Andreas Sochopoulos,Duolikun Danier,Chris Xiaoxuan Lu,Oisin Mac Aodha
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The integration of pre-trained visual representations (PVRs) into visuo-motor robot learning has emerged as a promising alternative to training visual encoders from scratch. However, PVRs face critical challenges in the context of policy learning, including temporal entanglement and an inability to generalise even in the presence of minor scene perturbations. These limitations hinder performance in tasks requiring temporal awareness and robustness to scene changes. This work identifies these shortcomings and proposes solutions to address them. First, we augment PVR features with temporal perception and a sense of task completion, effectively disentangling them in time. Second, we introduce a module that learns to selectively attend to task-relevant local features, enhancing robustness when evaluated on out-of-distribution scenes. Our experiments demonstrate significant performance improvements, particularly in PVRs trained with masking objectives, and validate the effectiveness of our enhancements in addressing PVR-specific limitations.
zh

[CV-18] ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models

【速读】：该论文旨在解决未见过物体实例分割（Unseen Object Instance Segmentation, UOIS）的问题，特别是在复杂环境中的功能增强。传统方法依赖大量标注数据进行监督学习，而UOIS方法则通过合成数据训练模型以泛化到新物体，但常受制于仿真与现实之间的差距。论文提出了一种新颖的方法（ZISVFM），其关键是利用Segment Anything Model (SAM)的零样本能力以及自监督视觉变换器（Vision Transformer, ViT）的显式视觉表示来改进分割精度。该框架通过三个阶段实现：首先从着色深度图像生成无对象相关的掩膜提议；其次使用基于注意力的特征从自监督ViT过滤非对象掩膜；最后应用K-Medoids聚类生成点提示以指导SAM进行精确的物体分割。

链接: https://arxiv.org/abs/2502.03266
作者: Ying Zhang,Maoliang Yin,Wenfu Bi,Haibao Yan,Shaohan Bian,Cui-Hua Zhang,Changchun Hua
机构: School of Electrical Engineering, Yanshan University (燕山大学), Qinhuangdao, 066004, China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention-based features from the selfsupervised ViT to filter non-object masks, and (3) applying K-Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self-collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at this https URL.
zh

[CV-19] Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration

【速读】：该论文旨在解决计算机辅助诊断在处理疾病样本不平衡问题时所面临的挑战，导致算法偏向多数类别而影响罕见类别的诊断性能。为了解决这一问题，论文提出了一种新的Long-tailed Medical Diagnosis (LMD)框架，其关键是通过Relation-aware Representation Learning (RRL)方案增强表示能力，并采用Iterative Classifier Calibration (ICC)方案迭代校准分类器，从而实现对少数类别的补偿，促进无偏分类器优化同时保持多数类别的诊断知识。

链接: https://arxiv.org/abs/2502.03238
作者: Li Pan,Yupei Zhang,Qiushi Yang,Tan Li,Zhen Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: This work has been accepted in Computers in Biology and Medicine

点击查看摘要

Abstract:Recently computer-aided diagnosis has demonstrated promising performance, effectively alleviating the workload of clinicians. However, the inherent sample imbalance among different diseases leads algorithms biased to the majority categories, leading to poor performance for rare categories. Existing works formulated this challenge as a long-tailed problem and attempted to tackle it by decoupling the feature representation and classification. Yet, due to the imbalanced distribution and limited samples from tail classes, these works are prone to biased representation learning and insufficient classifier calibration. To tackle these problems, we propose a new Long-tailed Medical Diagnosis (LMD) framework for balanced medical image classification on long-tailed datasets. In the initial stage, we develop a Relation-aware Representation Learning (RRL) scheme to boost the representation ability by encouraging the encoder to capture intrinsic semantic features through different data augmentations. In the subsequent stage, we propose an Iterative Classifier Calibration (ICC) scheme to calibrate the classifier iteratively. This is achieved by generating a large number of balanced virtual features and fine-tuning the encoder using an Expectation-Maximization manner. The proposed ICC compensates for minority categories to facilitate unbiased classifier optimization while maintaining the diagnostic knowledge in majority classes. Comprehensive experiments on three public long-tailed medical datasets demonstrate that our LMD framework significantly surpasses state-of-the-art approaches. The source code can be accessed at this https URL.
zh

[CV-20] Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search WWW

【速读】：该论文旨在解决基于文本的行人异常搜索（Text-based Person Anomaly Search, TPAS）任务中的行人行为识别问题，特别是在大规模行人图像库中准确识别正常或异常行为。解决方案的关键在于引入了相似性覆盖分析（Similarity Coverage Analysis, SCA）策略，以应对由于相似文本描述导致的识别难度，从而有效提升模型处理细微差异的能力，增强搜索的准确性和可靠性。

链接: https://arxiv.org/abs/2502.03230
作者: Jiayi He,Shengeng Tang,Ao Liu,Lechao Cheng,Jingjing Wu,Yanyan Wei
机构: School of Computer Science and Information Engineering, Hefei University of Technology (计算机科学与信息工程学院, 合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by 2025 WWW Workshop on MORE

点击查看摘要

Abstract:This paper presents the HFUT-LMC team’s solution to the WWW 2025 challenge on Text-based Person Anomaly Search (TPAS). The primary objective of this challenge is to accurately identify pedestrians exhibiting either normal or abnormal behavior within a large library of pedestrian images. Unlike traditional video analysis tasks, TPAS significantly emphasizes understanding and interpreting the subtle relationships between text descriptions and visual data. The complexity of this task lies in the model’s need to not only match individuals to text descriptions in massive image datasets but also accurately differentiate between search results when faced with similar descriptions. To overcome these challenges, we introduce the Similarity Coverage Analysis (SCA) strategy to address the recognition difficulty caused by similar text descriptions. This strategy effectively enhances the model’s capacity to manage subtle differences, thus improving both the accuracy and reliability of the search. Our proposed solution demonstrated excellent performance in this challenge.
zh

[CV-21] A Unified Framework for Semi-Supervised Image Segmentation and Registration

【速读】：该论文旨在解决医学图像分割中注释数据获取耗时且成本高的问题。解决方案的关键在于引入一种结合图像配准模型的方法，用于生成未标注数据的伪标签，从而产生更几何正确的伪标签，以提升模型训练效果。实验结果显示，该方法在仅使用1%的标注数据情况下仍表现出色，并优于传统的半监督分割方法，特别是在低标注比例场景下。

链接: https://arxiv.org/abs/2502.03229
作者: Ruizhe Li,Grazziela Figueredo,Dorothee Auer,Rob Dineen,Paul Morgan,Xin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE International Symposium on Biomedical Imaging (ISBI) 2025

点击查看摘要

Abstract:Semi-supervised learning, which leverages both annotated and unannotated data, is an efficient approach for medical image segmentation, where obtaining annotations for the whole dataset is time-consuming and costly. Traditional semi-supervised methods primarily focus on extracting features and learning data distributions from unannotated data to enhance model training. In this paper, we introduce a novel approach incorporating an image registration model to generate pseudo-labels for the unannotated data, producing more geometrically correct pseudo-labels to improve the model training. Our method was evaluated on a 2D brain data set, showing excellent performance even using only 1% of the annotated data. The results show that our approach outperforms conventional semi-supervised segmentation methods (e.g. teacher-student model), particularly in a low percentage of annotation scenario. GitHub: this https URL.
zh

[CV-22] GARAD-SLAM: 3D GAussian splatting for Real-time Anti Dynamic SLAM

【速读】：该论文旨在解决在动态场景中使用基于3D高斯点扩散（3D Gaussian Splatting, 3DGS）的即时定位与建图（SLAM）系统时所面临的映射错误和跟踪漂移问题。解决方案的关键在于GARAD-SLAM系统，它通过直接对高斯分布进行动态分割，并利用高斯金字塔网络将其映射回前端以获取动态点标签，从而实现精确的动态对象去除和鲁棒跟踪。同时，在地图构建方面，通过对动态标记的高斯分布施加渲染惩罚，并通过网络更新这些分布，避免了因简单修剪而导致的不可逆错误去除。

链接: https://arxiv.org/abs/2502.03228
作者: Mingrui Li,Weijian Chen,Na Cheng,Jingyuan Xu,Dong Li,Hongyu Wang
机构: School of Information and Communication Engineering, Dalian University of Technology (信息与通信工程学院, 大连理工大学); School of Aeronautics and Astronautics, Sun Yat-sen University (航空宇航学院, 中山大学); Faculty of Science and Technology, University of Macau (科技学院, 澳门大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The 3D Gaussian Splatting (3DGS)-based SLAM system has garnered widespread attention due to its excellent performance in real-time high-fidelity rendering. However, in real-world environments with dynamic objects, existing 3DGS-based SLAM systems often face mapping errors and tracking drift issues. To address these problems, we propose GARAD-SLAM, a real-time 3DGS-based SLAM system tailored for dynamic scenes. In terms of tracking, unlike traditional methods, we directly perform dynamic segmentation on Gaussians and map them back to the front-end to obtain dynamic point labels through a Gaussian pyramid network, achieving precise dynamic removal and robust tracking. For mapping, we impose rendering penalties on dynamically labeled Gaussians, which are updated through the network, to avoid irreversible erroneous removal caused by simple pruning. Our results on real-world datasets demonstrate that our method is competitive in tracking compared to baseline methods, generating fewer artifacts and higher-quality reconstructions in rendering.
zh

[CV-23] MotionAgent : Fine-grained Controllable Video Generation via Motion Field Agent

【速读】：该论文旨在解决文本引导的图像到视频生成中的精细运动控制问题。解决方案的关键在于引入运动场代理（Motion Field Agent），它能够将文本提示中的运动信息转换为显式的运动场，从而提供灵活且精确的运动指导。具体而言，代理提取文本描述的对象移动和相机运动，并分别将其转化为对象轨迹和相机外部参数。通过一个分析性光流合成模块，这些运动表示在三维空间中被整合并投影为统一的光流。随后，光流适配器利用该光流来控制基础的图像到视频扩散模型，以生成具有精细控制的视频。

链接: https://arxiv.org/abs/2502.03207
作者: Xinyao Liao,Xianfang Zeng,Liao Wang,Gang Yu,Guosheng Lin,Chi Zhang
机构: Nanyang Technological University (南洋理工大学); StepFun; ShanghaiTech University (上海科技大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
zh

[CV-24] MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding

【速读】：该论文旨在解决现代视频大型语言模型（VLLMs）在视频理解过程中由于均匀帧采样导致的关键信息丢失问题。论文提出的关键解决方案是MaxInfo方法，这是一种基于最大体积原则的无训练方法，通过选择和保留最具代表性的帧来减少冗余并保持多样性。这种方法通过最大化选定嵌入向量形成的几何体积，确保所选帧覆盖嵌入空间中最具有信息量的区域，从而提高输入表示的质量和长视频理解性能。

链接: https://arxiv.org/abs/2502.03183
作者: Pengyi Li,Irina Abdullaeva,Alexander Gambashidze,Andrey Kuznetsov,Ivan Oseledets
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern Video Large Language Models (VLLMs) often rely on uniform frame sampling for video understanding, but this approach frequently fails to capture critical information due to frame redundancy and variations in video content. We propose MaxInfo, a training-free method based on the maximum volume principle, which selects and retains the most representative frames from the input video. By maximizing the geometric volume formed by selected embeddings, MaxInfo ensures that the chosen frames cover the most informative regions of the embedding space, effectively reducing redundancy while preserving diversity. This method enhances the quality of input representations and improves long video comprehension performance across benchmarks. For instance, MaxInfo achieves a 3.28% improvement on LongVideoBench and a 6.4% improvement on EgoSchema for LLaVA-Video-7B. It also achieves a 3.47% improvement for LLaVA-Video-72B. The approach is simple to implement and works with existing VLLMs without the need for additional training, making it a practical and effective alternative to traditional uniform sampling methods.
zh

[CV-25] 2Reg: Establishing spatial correspondence between images by the same language prompts

【速读】：该论文旨在解决图像配准任务中的自动化和训练需求问题。关键在于利用预训练的大规模多模态模型（GroundingDINO 和 SAM），通过相同的语言提示（language prompt）在不同图像上预测对应的分割区域对，从而实现全自动且无需训练的图像配准算法。这种方法不仅消除了昂贵且耗时的数据标注需求，还在实验中展示了优于无监督学习方法的性能，并与弱监督方法相当。

链接: https://arxiv.org/abs/2502.03118
作者: Wen Yan,Qianye Yang,Shiqi Huang,Yipei Wang,Shonit Punwani,Mark Emberton,Vasilis Stavrinides,Yipeng Hu,Dean Barratt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 5 pages, 3 figures, conference paper

点击查看摘要

Abstract:Spatial correspondence can be represented by pairs of segmented regions, such that the image registration networks aim to segment corresponding regions rather than predicting displacement fields or transformation parameters. In this work, we show that such a corresponding region pair can be predicted by the same language prompt on two different images using the pre-trained large multimodal models based on GroundingDINO and SAM. This enables a fully automated and training-free registration algorithm, potentially generalisable to a wide range of image registration tasks. In this paper, we present experimental results using one of the challenging tasks, registering inter-subject prostate MR images, which involves both highly variable intensity and morphology between patients. Tell2Reg is training-free, eliminating the need for costly and time-consuming data curation and labelling that was previously required for this registration task. This approach outperforms unsupervised learning-based registration methods tested, and has a performance comparable to weakly-supervised methods. Additional qualitative results are also presented to suggest that, for the first time, there is a potential correlation between language semantics and spatial correspondence, including the spatial invariance in language-prompted regions and the difference in language prompts between the obtained local and global correspondences. Code is available at this https URL.
zh

[CV-26] Edge Attention Module for Object Classification

【速读】：该论文旨在解决对象分类任务中因类别不平衡和类间相似性导致的传统卷积神经网络（Convolutional Neural Network, CNN）性能受限的问题。为解决这一问题，论文提出了一种全新的“边缘注意力模块（Edge Attention Module, EAM）”，其中包含最大-最小池化层和随后的卷积层。该模块通过引入一种专门用于捕捉关键边缘信息的新颖池化技术，使网络能够优先关注重要的边缘特征，从而显著提升模型的准确率和F1分数。实验结果表明，所提出的框架在Caltech-101和Caltech-256数据集上分别达到了95.5%和86%的准确率，优于现有的预训练CNN模型及近期趋势模型如Pooling-based Vision Transformer (PiT)、Convolutional Block Attention Module (CBAM)和ConvNext。

链接: https://arxiv.org/abs/2502.03103
作者: Santanu Roy,Ashvath Suresh,Archit Gupta
机构: NIIT University (尼特大学); Christ (Deemed to be University) (基督大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:A novel edge attention-based Convolutional Neural Network (CNN)'' is proposed in this research for object classification task. With the advent of advanced computing technology, CNN models have achieved to remarkable success, particularly in computer vision applications. Nevertheless, the efficacy of the conventional CNN is often hindered due to class imbalance and inter-class similarity problems, which are particularly prominent in the computer vision field. In this research, we introduce for the first time an Edge Attention Module (EAM)‘’ consisting of a Max-Min pooling layer, followed by convolutional layers. This Max-Min pooling is entirely a novel pooling technique, specifically designed to capture only the edge information that is crucial for any object classification task. Therefore, by integrating this novel pooling technique into the attention module, the CNN network inherently prioritizes on essential edge features, thereby boosting the accuracy and F1-score of the model significantly. We have implemented our proposed EAM or 2EAMs on several standard pre-trained CNN models for Caltech-101, Caltech-256, CIFAR-100 and Tiny ImageNet-200 datasets. The extensive experiments reveal that our proposed framework (that is, EAM with CNN and 2EAMs with CNN), outperforms all pre-trained CNN models as well as recent trend models Pooling-based Vision Transformer (PiT)'', Convolutional Block Attention Module (CBAM)‘’, and ConvNext, by substantial margins. We have achieved the accuracy of 95.5% and 86% by the proposed framework on Caltech-101 and Caltech-256 datasets, respectively. So far, this is the best results on these datasets, to the best of our knowledge.
zh

[CV-27] Human-Aligned Image Models Improve Visual Decoding from the Brain

【速读】：该论文旨在解决从脑活动中解码视觉图像的问题，以推进脑机交互并增强对人类感知的理解。论文的关键在于引入了与人类对齐的图像编码器来映射大脑信号到图像，这种方法通过更好地捕捉快速视觉刺激呈现中常见的感知属性，显著提升了图像检索精度，相比现有方法提高了最多21%。

链接: https://arxiv.org/abs/2502.03081
作者: Nona Rajabi,Antônio H. Ribeiro,Miguel Vasco,Farzaneh Taleb,Mårten Björkman,Danica Kragic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decoding visual images from brain activity has significant potential for advancing brain-computer interaction and enhancing the understanding of human perception. Recent approaches align the representation spaces of images and brain activity to enable visual decoding. In this paper, we introduce the use of human-aligned image encoders to map brain signals to images. We hypothesize that these models more effectively capture perceptual attributes associated with the rapid visual stimuli presentations commonly used in visual brain data recording experiments. Our empirical results support this hypothesis, demonstrating that this simple modification improves image retrieval accuracy by up to 21% compared to state-of-the-art methods. Comprehensive experiments confirm consistent performance improvements across diverse EEG architectures, image encoders, alignment methods, participants, and brain imaging modalities.
zh

[CV-28] RoboGrasp: A Universal Grasping Policy for Robust Robotic Control

【速读】：该论文旨在解决机器人抓取任务中的精确性和泛化性不足的问题。现有方法依赖于机器人臂状态数据和RGB图像，导致其在特定物体形状或位置上的过拟合。论文提出的关键解决方案是RoboGrasp框架，它集成了预训练的抓取检测模型与机器人学习，并利用来自目标检测和分割任务的强大视觉引导，显著提升了抓取的精度、稳定性和泛化能力，在少样本学习和抓取盒提示任务中成功率提高了高达34%。

链接: https://arxiv.org/abs/2502.03072
作者: Yiqi Huang,Travis Davies,Jiahuan Yan,Xiang Chen,Yu Tian,Luhui Hu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Imitation learning and world models have shown significant promise in advancing generalizable robotic learning, with robotic grasping remaining a critical challenge for achieving precise manipulation. Existing methods often rely heavily on robot arm state data and RGB images, leading to overfitting to specific object shapes or positions. To address these limitations, we propose RoboGrasp, a universal grasping policy framework that integrates pretrained grasp detection models with robotic learning. By leveraging robust visual guidance from object detection and segmentation tasks, RoboGrasp significantly enhances grasp precision, stability, and generalizability, achieving up to 34% higher success rates in few-shot learning and grasping box prompt tasks. Built on diffusion-based methods, RoboGrasp is adaptable to various robotic learning paradigms, enabling precise and reliable manipulation across diverse and complex scenarios. This framework represents a scalable and versatile solution for tackling real-world challenges in robotic grasping.
zh

[CV-29] High-frequency near-eye ground truth for event-based eye tracking

【速读】：该论文旨在解决事件驱动型传感器在智能眼镜技术中用于高效低功耗眼动追踪时，可用数据集有限，尤其是缺乏关键的眼位标注数据的问题。论文的关键解决方案在于提出了一种专门设计用于事件驱动数据标注的半自动标注流程，并提供了以200Hz频率计算得到的瞳孔检测标注数据，从而为科学界提供了一个改进版的流行事件驱动型眼动追踪数据集。

链接: https://arxiv.org/abs/2502.03057
作者: Andrea Simpsi,Andrea Aspesi,Simone Mentasti,Luca Merigo,Tommaso Ongarello,Matteo Matteucci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based eye tracking is a promising solution for efficient and low-power eye tracking in smart eyewear technologies. However, the novelty of event-based sensors has resulted in a limited number of available datasets, particularly those with eye-level annotations, crucial for algorithm validation and deep-learning training. This paper addresses this gap by presenting an improved version of a popular event-based eye-tracking dataset. We introduce a semi-automatic annotation pipeline specifically designed for event-based data annotation. Additionally, we provide the scientific community with the computed annotations for pupil detection at 200Hz.
zh

[CV-30] Driver Assistance System Based on Multimodal Data Hazard Detection

【速读】：该论文旨在解决自动驾驶中罕见且不可预测驾驶事件的检测难题，现有方法主要依赖于单模态道路状况视频数据，这限制了其捕捉异常事件的能力。解决方案的关键在于提出了一种多模态驾驶员辅助检测系统，该系统整合了道路状况视频、驾驶员面部视频及音频数据，采用基于注意力机制的中间融合策略，实现了端到端学习，无需单独进行特征提取。

链接: https://arxiv.org/abs/2502.03005
作者: Long Zhouxiang,Ovanes Petrosian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous driving technology has advanced significantly, yet detecting driving anomalies remains a major challenge due to the long-tailed distribution of driving events. Existing methods primarily rely on single-modal road condition video data, which limits their ability to capture rare and unpredictable driving incidents. This paper proposes a multimodal driver assistance detection system that integrates road condition video, driver facial video, and audio data to enhance incident recognition accuracy. Our model employs an attention-based intermediate fusion strategy, enabling end-to-end learning without separate feature extraction. To support this approach, we develop a new three-modality dataset using a driving simulator. Experimental results demonstrate that our method effectively captures cross-modal correlations, reducing misjudgments and improving driving safety.
zh

[CV-31] Disentangling CLIP Features for Enhanced Localized Understanding

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在细粒度任务中由于特征相互信息（Mutual Feature Information, MFI）导致的特征纠缠问题。论文的关键解决方案是提出Unmix-CLIP框架，通过引入MFI损失函数显式分离文本特征，并利用多标签识别（Multi-Label Recognition, MLR）确保图像特征与分离后的文本特征对齐，从而实现图像和文本特征在跨模态上的解缠和对齐，以改善下游任务中的特征分离。

链接: https://arxiv.org/abs/2502.02977
作者: Samyak Rawelekar,Yujun Cai,Yiwei Wang,Ming-Hsuan Yang,Narendra Ahuja
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) demonstrate impressive capabilities in coarse-grained tasks like image classification and retrieval. However, they struggle with fine-grained tasks that require localized understanding. To investigate this weakness, we comprehensively analyze CLIP features and identify an important issue: semantic features are highly correlated. Specifically, the features of a class encode information about other classes, which we call mutual feature information (MFI). This mutual information becomes evident when we query a specific class and unrelated objects are activated along with the target class. To address this issue, we propose Unmix-CLIP, a novel framework designed to reduce MFI and improve feature disentanglement. We introduce MFI loss, which explicitly separates text features by projecting them into a space where inter-class similarity is minimized. To ensure a corresponding separation in image features, we use multi-label recognition (MLR) to align the image features with the separated text features. This ensures that both image and text features are disentangled and aligned across modalities, improving feature separation for downstream tasks. For the COCO- 14 dataset, Unmix-CLIP reduces feature similarity by 24.9%. We demonstrate its effectiveness through extensive evaluations of MLR and zeroshot semantic segmentation (ZS3). In MLR, our method performs competitively on the VOC2007 and surpasses SOTA approaches on the COCO-14 dataset, using fewer training parameters. Additionally, Unmix-CLIP consistently outperforms existing ZS3 methods on COCO and VOC
zh

[CV-32] VQA-Levels: A Hierarchical Approach for Classifying Questions in VQA

【速读】：该论文旨在解决现有视觉问答（Visual Question Answering, VQA）基准数据集在系统性测试方法上的不足。论文的关键解决方案是提出了一套新的基准数据集——VQA-Levels，该数据集将问题分为七个级别，从基于低级图像特征的直接答案到需要高级抽象理解的问题。通过这种方式，研究者可以更全面地评估VQA系统的性能，并推动该领域的发展。

链接: https://arxiv.org/abs/2502.02951
作者: Madhuri Latha Madaka,Chakravarthy Bhagvati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Designing datasets for Visual Question Answering (VQA) is a difficult and complex task that requires NLP for parsing and computer vision for analysing the relevant aspects of the image for answering the question asked. Several benchmark datasets have been developed by researchers but there are many issues with using them for methodical performance tests. This paper proposes a new benchmark dataset – a pilot version called VQA-Levels is ready now – for testing VQA systems systematically and assisting researchers in advancing the field. The questions are classified into seven levels ranging from direct answers based on low-level image features (without needing even a classifier) to those requiring high-level abstraction of the entire image content. The questions in the dataset exhibit one or many of ten properties. Each is categorised into a specific level from 1 to 7. Levels 1 - 3 are directly on the visual content while the remaining levels require extra knowledge about the objects in the image. Each question generally has a unique one or two-word answer. The questions are ‘natural’ in the sense that a human is likely to ask such a question when seeing the images. An example question at Level 1 is, What is the shape of the red colored region in the image?" while at Level 7, it is, Why is the man cutting the paper?". Initial testing of the proposed dataset on some of the existing VQA systems reveals that their success is high on Level 1 (low level features) and Level 2 (object classification) questions, least on Level 3 (scene text) followed by Level 6 (extrapolation) and Level 7 (whole scene analysis) questions. The work in this paper will go a long way to systematically analyze VQA systems.
zh

[CV-33] Every Angle Is Worth A Second Glance: Mining Kinematic Skeletal Structures from Multi-view Joint Cloud

【速读】：该论文旨在解决多人运动捕捉在稀疏角度观测条件下因自我遮挡和相互遮挡导致的问题。现有方法虽能生成精确的二维关节检测结果，但在三角化并提升至三维时，难以选择最准确的候选关节，并将其正确关联到相应的关节类型和目标身份。为充分利用所有准确的二维关节位置信息，论文提出独立地在所有视角中相同类型的二维关节之间进行三角化，形成关节云（Joint Cloud）。该关节云包含来自相同关节类型和目标ID的有效关节以及错误构造的虚假关节。针对冗余和不准确的候选关节，论文引入了一种名为关节云选择与聚合变换器（JCSAT）的方法，该方法通过三个级联编码器深入探索跨嵌入空间中所有三维点候选之间的轨迹、骨骼结构和视角相关性。此外，文中还提出了一种最优标记注意力路径（OTAP）模块，用于从这些冗余观测中选择和聚合信息特征，最终预测人体运动。为验证JCSAT的有效性，作者构建并发布了新的多人群运动捕捉数据集BUMocap-X，其中包含复杂交互和严重遮挡的情况。实验结果表明，所提出的框架在挑战性的遮挡场景下尤其优于现有的最先进方法。

链接: https://arxiv.org/abs/2502.02936
作者: Junkun Jiang,Jie Chen,Ho Yin Au,Mingyuan Chen,Wei Xue,Yike Guo
机构: Department of Computer Science, Hong Kong Baptist University(香港浸会大学计算机科学系); Division of Emerging Interdisciplinary Areas, the Hong Kong University of Science and Technology(香港科技大学新兴交叉学科研究院); Department of Computer Science and Engineering, the Hong Kong University of Science and Technology(香港科技大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Visualization and Computer Graphics

点击查看摘要

Abstract:Multi-person motion capture over sparse angular observations is a challenging problem under interference from both self- and mutual-occlusions. Existing works produce accurate 2D joint detection, however, when these are triangulated and lifted into 3D, available solutions all struggle in selecting the most accurate candidates and associating them to the correct joint type and target identity. As such, in order to fully utilize all accurate 2D joint location information, we propose to independently triangulate between all same-typed 2D joints from all camera views regardless of their target ID, forming the Joint Cloud. Joint Cloud consist of both valid joints lifted from the same joint type and target ID, as well as falsely constructed ones that are from different 2D sources. These redundant and inaccurate candidates are processed over the proposed Joint Cloud Selection and Aggregation Transformer (JCSAT) involving three cascaded encoders which deeply explore the trajectile, skeletal structural, and view-dependent correlations among all 3D point candidates in the cross-embedding space. An Optimal Token Attention Path (OTAP) module is proposed which subsequently selects and aggregates informative features from these redundant observations for the final prediction of human motion. To demonstrate the effectiveness of JCSAT, we build and publish a new multi-person motion capture dataset BUMocap-X with complex interactions and severe occlusions. Comprehensive experiments over the newly presented as well as benchmark datasets validate the effectiveness of the proposed framework, which outperforms all existing state-of-the-art methods, especially under challenging occlusion scenarios.
zh

[CV-34] Elucidating the Preconditioning in Consistency Distillation ICLR2025

【速读】：该论文旨在解决一致性蒸馏（Consistency Distillation）过程中预处理方法（preconditioning）可能存在的次优选择问题。解决方案的关键在于首次提供了一种理论上的见解，阐明了预处理的设计标准及其与教师模型常微分方程（ODE）轨迹之间的联系。基于这些分析，作者提出了一种称为“Analytic-Precond”的原则性方法，能够根据一致性差距（即教师去噪器与最优学生去噪器之间的差距）在广义教师ODE上进行解析优化。这种方法能够促进轨迹跳跃的学习，增强学生轨迹与教师轨迹的一致性，并实现多步生成中一致性轨迹模型训练速度2到3倍的加速。

链接: https://arxiv.org/abs/2502.02922
作者: Kaiwen Zheng,Guande He,Jianfei Chen,Fan Bao,Jun Zhu
机构: Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, THBI Lab (计算机科学与技术系，人工智能研究院，BNRist中心，THBI实验室); Tsinghua-Bosch Joint ML Center, Tsinghua University (清华大学博世联合机器学习中心，清华大学); Shengshu Technology (深数科技); Pazhou Lab (Huangpu), Guangzhou, China (广州黄埔琶洲实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Consistency distillation is a prevalent way for accelerating diffusion models adopted in consistency (trajectory) models, in which a student model is trained to traverse backward on the probability flow (PF) ordinary differential equation (ODE) trajectory determined by the teacher model. Preconditioning is a vital technique for stabilizing consistency distillation, by linear combining the input data and the network output with pre-defined coefficients as the consistency function. It imposes the boundary condition of consistency functions without restricting the form and expressiveness of the neural network. However, previous preconditionings are hand-crafted and may be suboptimal choices. In this work, we offer the first theoretical insights into the preconditioning in consistency distillation, by elucidating its design criteria and the connection to the teacher ODE trajectory. Based on these analyses, we further propose a principled way dubbed \textitAnalytic-Precond to analytically optimize the preconditioning according to the consistency gap (defined as the gap between the teacher denoiser and the optimal student denoiser) on a generalized teacher ODE. We demonstrate that Analytic-Precond can facilitate the learning of trajectory jumpers, enhance the alignment of the student trajectory with the teacher’s, and achieve 2\times to 3\times training acceleration of consistency trajectory models in multi-step generation across various datasets.
zh

[CV-35] Maximizing the Position Embedding for Vision Transformers with Global Averag e Pooling AAAI2025

【速读】：该论文旨在解决在视觉变换器（Vision Transformers）中，层间结构下全局平均池化（Global Average Pooling, GAP）方法与位置嵌入（Position Embedding, PE）之间的冲突问题。解决方案的关键在于提出MPVG方法，通过最大化PE在层间结构中的有效性来克服这一问题，确保PE能够有效地平衡各个层中的标记嵌入值，从而显著提升视觉变换器在多种任务中的性能。

链接: https://arxiv.org/abs/2502.02919
作者: Wonjun Lee,Bumsub Ham,Suhyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layer-wise method that delivers PE to each layer and applies independent Layer Normalizations for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the counterbalancing role of PE is insufficient in the layer-wise structure, and we address this by maximizing the effectiveness of PE through MPVG. Through experiments, we demonstrate that PE performs a counterbalancing role and that maintaining this counterbalancing directionality significantly impacts vision transformers. As a result, the experimental results show that MPVG outperforms existing methods across vision transformers on various tasks.
zh

[CV-36] PoleStack: Robust Pole Estimation of Irregular Objects from Silhouette Stacking

【速读】：该论文旨在解决通过多视角轮廓图像估计主轴旋转器旋转极点的问题。解决方案的关键在于首先将一组图像堆叠成单一的轮廓堆栈图像，并通过识别轮廓堆栈中的最大对称性来估算投影极点方向。为了处理质心图像位置的未知性，应用离散傅里叶变换（Discrete Fourier Transform, DFT）生成轮廓堆栈幅度谱，从而实现平移不变性和增强抗噪能力。其次，通过结合从不同相机姿态收集的两个或多个投影极点测量值，估算三维极点方向。该方法展示了在低分辨率图像下达到度级精度的极点估计，并表现出对严重表面阴影和基于质心的图像配准误差的鲁棒性。

链接: https://arxiv.org/abs/2502.02907
作者: Jacopo Villa,Jay W. McMahon,Issa A. D. Nesnas
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Jet Propulsion Laboratory, California Institute of Technology (喷气推进实验室，加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present an algorithm to estimate the rotation pole of a principal-axis rotator using silhouette images collected from multiple camera poses. First, a set of images is stacked to form a single silhouette-stack image, where the object’s rotation introduces reflective symmetry about the imaged pole direction. We estimate this projected-pole direction by identifying maximum symmetry in the silhouette stack. To handle unknown center-of-mass image location, we apply the Discrete Fourier Transform to produce the silhouette-stack amplitude spectrum, achieving translation invariance and increased robustness to noise. Second, the 3D pole orientation is estimated by combining two or more projected-pole measurements collected from different camera orientations. We demonstrate degree-level pole estimation accuracy using low-resolution imagery, showing robustness to severe surface shadowing and centroid-based image-registration errors. The proposed approach could be suitable for pole estimation during both the approach phase toward a target object and while hovering.
zh

[CV-37] Enhancing Quantum-ready QUBO-based Suppression for Object Detection with Appearance and Confidence Features

【速读】：该论文旨在解决现有基于二次无约束二进制优化（QUBO）的方法在处理拥挤场景中的遮挡目标检测时仍可能遗漏的问题。关键在于提出了一种新的QUBO公式，该公式通过整合图像相似性度量计算得到的外观特征（appearance feature）以及置信度分数的乘积来改进现有的QUBO公式中的成对分数（pairwise score），从而更有效地区分预测重叠是由遮挡引起的还是冗余预测导致的。这一改进显著提升了平均精度均值（mAP）和平均召回率均值（mAR），分别提高了4.54和9.89个百分点，且没有显著增加运行时间。

链接: https://arxiv.org/abs/2502.02895
作者: Keiichiro Yamamura,Toru Mitsutake,Hiroki Ishikura,Daiki Kusuhara,Akihiro Yoshida,Katsuki Fujisawa
机构: Institute of Integrated Research, Institute of Science Tokyo(集成研究综合研究所); Kyushu University(九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages for main contents, 3 pages for appendix, 3 pages for reference

点击查看摘要

Abstract:Quadratic Unconstrained Binary Optimization (QUBO)-based suppression in object detection is known to have superiority to conventional Non-Maximum Suppression (NMS), especially for crowded scenes where NMS possibly suppresses the (partially-) occluded true positives with low confidence scores. Whereas existing QUBO formulations are less likely to miss occluded objects than NMS, there is room for improvement because existing QUBO formulations naively consider confidence scores and pairwise scores based on spatial overlap between predictions. This study proposes new QUBO formulations that aim to distinguish whether the overlap between predictions is due to the occlusion of objects or due to redundancy in prediction, i.e., multiple predictions for a single object. The proposed QUBO formulation integrates two features into the pairwise score of the existing QUBO formulation: i) the appearance feature calculated by the image similarity metric and ii) the product of confidence scores. These features are derived from the hypothesis that redundant predictions share a similar appearance feature and (partially-) occluded objects have low confidence scores, respectively. The proposed methods demonstrate significant advancement over state-of-the-art QUBO-based suppression without a notable increase in runtime, achieving up to 4.54 points improvement in mAP and 9.89 points gain in mAR.
zh

[CV-38] INST-Sculpt: Interactive Stroke-based Neural SDF Sculpting

【速读】：该论文旨在解决在隐式神经表示上直接进行交互式表面雕刻编辑的问题。传统基于网格的工具如ZBrush虽然能够快速直观地进行编辑，但缺乏适用于隐式SDF雕塑的相应工具。论文的关键解决方案在于引入了一种框架，允许用户直接在神经隐式表示上进行基于笔画的修改，通过采用管状邻域采样笔画和自定义笔刷轮廓，实现沿用户定义曲线的平滑变形，从而提供精确的雕塑控制，确保在保持隐式表示平滑特性的同时进行复杂多样的编辑操作。

链接: https://arxiv.org/abs/2502.02891
作者: Fizza Rubab,Yiying Tong
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in implicit neural representations have made them a popular choice for modeling 3D geometry, achieving impressive results in tasks such as shape representation, reconstruction, and learning priors. However, directly editing these representations poses challenges due to the complex relationship between model weights and surface regions they influence. Among such editing tools, sculpting, which allows users to interactively carve or extrude the surface, is a valuable editing operation to the graphics and modeling community. While traditional mesh-based tools like ZBrush facilitate fast and intuitive edits, a comparable toolkit for sculpting neural SDFs is currently lacking. We introduce a framework that enables interactive surface sculpting edits directly on neural implicit representations. Unlike previous works limited to spot edits, our approach allows users to perform stroke-based modifications on the fly, ensuring intuitive shape manipulation without switching representations. By employing tubular neighborhoods to sample strokes and custom brush profiles, we achieve smooth deformations along user-defined curves, providing precise control over the sculpting process. Our method demonstrates that intricate and versatile edits can be made while preserving the smooth nature of implicit representations.
zh

[CV-39] Expertized Caption Auto-Enhancement for Video-Text Retrieval

【速读】：该论文旨在解决视频文本检索领域中由于视频文字描述不足导致的文本与视频匹配难题。关键解决方案在于提出了一种自动标题增强方法，通过自学习提升表达质量和减少增强标题中的经验主义，并设计了一个专家化标题选择机制，以个性化方式定制每个视频的增强标题，从而促进文本与视频的匹配。这种方法完全基于数据驱动，不仅避免了大量数据收集和计算负担，还通过规避词典依赖性和引入个性化匹配来提高自适应性。

链接: https://arxiv.org/abs/2502.02885
作者: Junxiang Chen,Baoyao yang,Wenbin Yao
机构: WeChat, Tencent; Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The burgeoning field of video-text retrieval has witnessed significant advancements with the advent of deep learning. However, the challenge of matching text and video persists due to inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders a comprehensive understanding of videos, resulting in ambiguous retrieval results. While rewriting methods based on large language models have been proposed to broaden text expressions, carefully crafted prompts are essential to ensure the reasonableness and completeness of the rewritten texts. This paper proposes an automatic caption enhancement method that enhances expression quality and mitigates empiricism in augmented captions through self-learning. Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, facilitating video-text matching. Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo.
zh

[CV-40] Domain-Invariant Per-Frame Feature Extraction for Cross-Domain Imitation Learning with Visual Observations ICML2025

【速读】：该论文旨在解决在跨领域场景下，模仿学习（Imitation Learning, IL）面临的高维、噪声和不完整视觉观测数据的挑战。论文的关键解决方案是提出了一种名为Domain-Invariant Per-Frame Feature Extraction for Imitation Learning (DIFF-IL) 的方法，该方法从单帧图像中提取域不变特征，并将其适配到序列中以隔离和复制专家行为。此外，引入了基于帧的时间标记技术，通过时间步分割专家行为并赋予与时间上下文对齐的奖励，从而增强任务性能。

链接: https://arxiv.org/abs/2502.02867
作者: Minung Kim,Kawon Lee,Jungmo Kim,Sungho Choi,Seungyul Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages main, 19 pages appendix with reference. Submitted to ICML 2025

点击查看摘要

Abstract:Imitation learning (IL) enables agents to mimic expert behavior without reward signals but faces challenges in cross-domain scenarios with high-dimensional, noisy, and incomplete visual observations. To address this, we propose Domain-Invariant Per-Frame Feature Extraction for Imitation Learning (DIFF-IL), a novel IL method that extracts domain-invariant features from individual frames and adapts them into sequences to isolate and replicate expert behaviors. We also introduce a frame-wise time labeling technique to segment expert behaviors by timesteps and assign rewards aligned with temporal contexts, enhancing task performance. Experiments across diverse visual environments demonstrate the effectiveness of DIFF-IL in addressing complex visual tasks.
zh

[CV-41] RS-YOLOX: A High Precision Detector for Object Detection in Satellite Remote Sensing Images

【速读】：该论文旨在解决卫星遥感图像自动检测中存在的问题。关键解决方案在于提出了一种改进的YOLOX模型（RS-YOLOX），通过在YOLOX的主干网络中引入Efficient Channel Attention (ECA)，结合Adaptively Spatial Feature Fusion (ASFF)到YOLOX的颈部网络，并采用Varifocal Loss函数平衡正负样本数量，最终结合Slicing Aided Hyper Inference (SAHI)框架以获得高性能的遥感目标检测器。

链接: https://arxiv.org/abs/2502.02850
作者: Lei Yang,Guowu Yuan,Hao Zhou,Hongyu Liu,Jian Chen,Hao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic object detection by satellite remote sensing images is of great significance for resource exploration and natural disaster assessment. To solve existing problems in remote sensing image detection, this article proposes an improved YOLOX model for satellite remote sensing image automatic detection. This model is named RS-YOLOX. To strengthen the feature learning ability of the network, we used Efficient Channel Attention (ECA) in the backbone network of YOLOX and combined the Adaptively Spatial Feature Fusion (ASFF) with the neck network of YOLOX. To balance the numbers of positive and negative samples in training, we used the Varifocal Loss function. Finally, to obtain a high-performance remote sensing object detector, we combined the trained model with an open-source framework called Slicing Aided Hyper Inference (SAHI). This work evaluated models on three aerial remote sensing datasets (DOTA-v1.5, TGRS-HRRSD, and RSOD). Our comparative experiments demonstrate that our model has the highest accuracy in detecting objects in remote sensing image datasets.
zh

[CV-42] A Survey of Sample-Efficient Deep Learning for Change Detection in Remote Sensing: Tasks Strategies and Challenges

【速读】：该论文旨在解决在实际应用中变化检测（Change Detection, CD）方法受限的问题，主要由于输入数据多样性和应用场景的复杂性。论文特别关注时间序列遥感图像（Remote Sensing Images, RSI）的变化检测需求，并且强调了样本量不足对深度神经网络（Deep Neural Network, DNN）训练的挑战。为了解决这些挑战，论文提出利用图像生成、自监督学习（self-supervision）以及视觉基础模型（Visual Foundation Models, VFM）等最新进展来缓解深度学习（Deep Learning, DL）方法对大量数据的需求。关键解决方案在于开发适用于不同应用情景和样本限制条件下的具体CD方法，从而促进这些方法在更广泛的应用场景中的发展和部署。

链接: https://arxiv.org/abs/2502.02835
作者: Lei Ding,Danfeng Hong,Maofan Zhao,Hongruixuan Chen,Chenyu Li,Jie Deng,Naoto Yokoya,Lorenzo Bruzzone,Jocelyn Chanussot
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院), Beijing; Information Engineering University (信息工程大学), Zhengzhou, China; Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院), Beijing, 100094, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院), Beijing, 100049, China; Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院), 100094 Beijing, China; School of Mathematics and Statistics, Southeast University (东南大学数学与统计学院), 211189 Nanjing, China; Graduate School of Frontier Sciences, The University of Tokyo (东京大学前沿科学研究生院), Chiba 277-8561, Japan; RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心), Tokyo 103-0027, Japan; Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, The University of Tokyo (东京大学前沿科学研究生院复杂科学与工程系), Chiba 277-8561, Japan; Department of Information Engineering and Computer Science, University of Trento (特伦托大学信息工程与计算机科学系), 38123 Trento, Italy; Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学), Inria (法国国家信息与自动化研究所), CNRS (法国国家科学研究中心), Grenoble INP (格勒诺布尔国立理工学院), LJK (劳埃德·卡德维尔实验室), Grenoble, 38000, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE GRSM

点击查看摘要

Abstract:In the last decade, the rapid development of deep learning (DL) has made it possible to perform automatic, accurate, and robust Change Detection (CD) on large volumes of Remote Sensing Images (RSIs). However, despite advances in CD methods, their practical application in real-world contexts remains limited due to the diverse input data and the applicational context. For example, the collected RSIs can be time-series observations, and more informative results are required to indicate the time of change or the specific change category. Moreover, training a Deep Neural Network (DNN) requires a massive amount of training samples, whereas in many cases these samples are difficult to collect. To address these challenges, various specific CD methods have been developed considering different application scenarios and training resources. Additionally, recent advancements in image generation, self-supervision, and visual foundation models (VFMs) have opened up new approaches to address the ‘data-hungry’ issue of DL-based CD. The development of these methods in broader application scenarios requires further investigation and discussion. Therefore, this article summarizes the literature methods for different CD tasks and the available strategies and techniques to train and deploy DL-based CD methods in sample-limited scenarios. We expect that this survey can provide new insights and inspiration for researchers in this field to develop more effective CD methods that can be applied in a wider range of contexts.
zh

[CV-43] AIoT-based smart traffic management system

【速读】：该论文旨在解决城市环境中交通流量优化和拥堵减少的问题。解决方案的关键在于利用基于AI的智能交通管理系统，通过分析现有CCTV摄像头的实时视频流，实现车辆计数和交通密度评估，进而进行自适应信号控制，优先处理高流量方向。这种实时适应性确保了更顺畅的交通流，减少了拥堵，并缩短了司机的等待时间。

链接: https://arxiv.org/abs/2502.02821
作者: Ahmed Mahmoud Elbasha,Mohammad M. Abdellatif
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel AI-based smart traffic management system de-signed to optimize traffic flow and reduce congestion in urban environments. By analysing live footage from existing CCTV cameras, this approach eliminates the need for additional hardware, thereby minimizing both deployment costs and ongoing maintenance expenses. The AI model processes live video feeds to accurately count vehicles and assess traffic density, allowing for adaptive signal control that prioritizes directions with higher traffic volumes. This real-time adaptability ensures smoother traffic flow, reduces congestion, and minimizes waiting times for drivers. Additionally, the proposed system is simulated using PyGame to evaluate its performance under various traffic conditions. The simulation results demonstrate that the AI-based system out-performs traditional static traffic light systems by 34%, leading to significant improvements in traffic flow efficiency. The use of AI to optimize traffic signals can play a crucial role in addressing urban traffic challenges, offering a cost-effective, scalable, and efficient solution for modern cities. This innovative system represents a key advancement in the field of smart city infra-structure and intelligent transportation systems.
zh

[CV-44] A Decade of Action Quality Assessment: Largest Systematic Survey of Trends Challenges and Future Directions

【速读】：该论文旨在解决动作质量评估（Action Quality Assessment, AQA）领域中的综合合成问题。尽管在AQA方法、数据集和应用方面已取得显著进展，但需要一个全面的综述来整合这一快速发展的领域。论文的关键解决方案是采用系统评价和元分析（Preferred Reporting Items for Systematic Reviews and Meta-Analyses, PRISMA）框架，系统性地回顾超过200篇研究论文，涵盖基础概念、定义、通用框架、性能指标以及最新方法和数据集进展。通过这项工作，论文提供了详细的研究趋势分析、性能比较、挑战及未来方向，旨在为新进和有经验的研究人员提供有价值的资源，促进AQA领域的进一步探索和进展。

链接: https://arxiv.org/abs/2502.02817
作者: Hao Yin,Paritosh Parmar,Daoliang Xu,Yang Zhang,Tianyou Zheng,Weiwei Fu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 Pages, 20 Figures, 12 Tables

点击查看摘要

Abstract:Action Quality Assessment (AQA) – the ability to quantify the quality of human motion, actions, or skill levels and provide feedback – has far-reaching implications in areas such as low-cost physiotherapy, sports training, and workforce development. As such, it has become a critical field in computer vision video understanding over the past decade. Significant progress has been made in AQA methodologies, datasets, applications, yet a pressing need remains for a comprehensive synthesis of this rapidly evolving field. In this paper, we present a thorough survey of the AQA landscape, systematically reviewing over 200 research papers using the preferred reporting items for systematic reviews meta-analyses (PRISMA) framework. We begin by covering foundational concepts definitions, then move to general frameworks performance metrics, finally discuss the latest advances in methodologies datasets. This survey provides a detailed analysis of research trends, performance comparisons, challenges, future directions. Through this work, we aim to offer a valuable resource for both newcomers experienced researchers, promoting further exploration progress in AQA. Data are available at this https URL
zh

[CV-45] 3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography

【速读】：该论文旨在解决在头颅计算机断层扫描（Head Computed Tomography, CT）影像中，由于高质量标签和注释的稀缺，尤其是在少见疾病的情况下，导致开发强大深度学习模型的困难。为了解决这一挑战，论文提出了一种名为FM-CT的自监督基础模型，用于通用疾病的检测。关键在于通过大规模、多样化的无对比剂3D头颅CT扫描数据集进行预训练，而无需人工标注，从而让模型学习到鲁棒且可泛化的特征。该方法结合了自我蒸馏和掩模图像建模，并采用三维而非二维切片层面的方法，以更全面和高效地利用头颅CT扫描的结构。

链接: https://arxiv.org/abs/2502.02779
作者: Weicheng Zhu,Haoxu Huang,Huanze Tang,Rushabh Musthyala,Boyang Yu,Long Chen,Emilio Vega,Thomas O’Donnell,Seena Dehkharghani,Jennifer A. Frontera,Arjun V. Masurkar,Kara Melmed,Narges Razavian
机构: New York University, Center for Data Science (纽约大学，数据科学中心); New York University, Courant Institute of Mathematical Sciences (纽约大学，库朗数学科学研究所); NYU Grossman School of Medicine, Department of Radiology (纽约大学格罗斯曼医学院，放射学系); NYU Grossman School of Medicine, Department of Neurology (纽约大学格罗斯曼医学院，神经病学系); NYU Grossman School of Medicine, Department of Neuroscience and Physiology (纽约大学格罗斯曼医学院，神经科学与生理学系); NYU Grossman School of Medicine, Neuroscience Institute (纽约大学格罗斯曼医学院，神经科学研究所); Siemens Healthineers (西门子医疗)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review Preprint

点击查看摘要

Abstract:Head computed tomography (CT) imaging is a widely-used imaging modality with multitudes of medical indications, particularly in assessing pathology of the brain, skull, and cerebrovascular system. It is commonly the first-line imaging in neurologic emergencies given its rapidity of image acquisition, safety, cost, and ubiquity. Deep learning models may facilitate detection of a wide range of diseases. However, the scarcity of high-quality labels and annotations, particularly among less common conditions, significantly hinders the development of powerful models. To address this challenge, we introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning. Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations, enabling the model to learn robust, generalizable features. To investigate the potential of self-supervised learning in head CT, we employed both discrimination with self-distillation and masked image modeling, and we construct our model in 3D rather than at the slice level (2D) to exploit the structure of head CT scans more comprehensively and efficiently. The model’s downstream classification performance is evaluated using internal and three external datasets, encompassing both in-distribution (ID) and out-of-distribution (OOD) data. Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks compared to models trained from scratch and previous 3D CT foundation models on scarce annotated datasets. This work highlights the effectiveness of self-supervised learning in medical imaging and sets a new benchmark for head CT image analysis in 3D, enabling broader use of artificial intelligence for head CT-based diagnosis.
zh

[CV-46] SD: Enhancing Standard Definition Maps by Incorporating Road Knowledge using LLM s

【速读】：该论文旨在解决标准定义（Standard Definition, SD）地图信息精度较低的问题，通过结合道路手册中的位置相关道路信息来增强SD地图。解决方案的关键在于开发了一个端到端的管道SD++，利用大型语言模型（LLMs）从道路手册中提取信息，并将其整合到SD地图中，以提升其精度和信息丰富度。

链接: https://arxiv.org/abs/2502.02773
作者: Hitvarth Diwanji,Jing-Yan Liao,Akshar Tumu,Henrik I. Christensen,Marcell Vazquez-Chanlatte,Chikao Tsuchiya
机构: Contextual Robotics Institute, University of California San Diego (加州大学圣地亚哥分校上下文机器人研究所); Dept. of Comp. Sci. and Eng., UC San Diego (加州大学圣地亚哥分校计算机科学与工程系); Nissan North America (日产北美)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-definition maps (HD maps) are detailed and informative maps capturing lane centerlines and road elements. Although very useful for autonomous driving, HD maps are costly to build and maintain. Furthermore, access to these high-quality maps is usually limited to the firms that build them. On the other hand, standard definition (SD) maps provide road centerlines with an accuracy of a few meters. In this paper, we explore the possibility of enhancing SD maps by incorporating information from road manuals using LLMs. We develop SD++, an end-to-end pipeline to enhance SD maps with location-dependent road information obtained from a road manual. We suggest and compare several ways of using LLMs for such a task. Furthermore, we show the generalization ability of SD++ by showing results from both California and Japan.
zh

[CV-47] Rethinking Vision Transformer for Object Centric Foundation Models

【速读】：该论文旨在解决高分辨率视觉场景中小对象分割的数据效率和计算复杂性问题。论文的关键解决方案是引入了一种名为离网视网膜状输入补丁（FLIP, Off-grid Fovea-Like Input Patching）的方法，该方法从一开始就以对象为中心的方式选择和编码图像输入，并将位置编码与对象中心感知代码分离。这种方法在标准基准测试中展示了其优越性，特别是在数据效率和计算资源消耗方面优于Segment Anything Model (SAM)和FastSAM。

链接: https://arxiv.org/abs/2502.02763
作者: Manuel Traub,Martin V. Butz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent state-of-the-art object segmentation mechanisms, such as the Segment Anything Model (SAM) and FastSAM, first encode the full image over several layers and then focus on generating the mask for one particular object or area. We present an off-grid Fovea-Like Input Patching (FLIP) approach, which selects image input and encodes it from the beginning in an object-focused manner. While doing so, it separates locational encoding from an object-centric perceptual code. FLIP is more data-efficient and yields improved segmentation performance when masking relatively small objects in high-resolution visual scenes. On standard benchmarks such as Hypersim, KITTI-360, and OpenImages, FLIP achieves Intersection over Union (IoU) scores that approach the performance of SAM with much less compute effort. It surpasses FastSAM in all IoU measurements. We also introduce an additional semi-natural but highly intuitive dataset where FLIP outperforms SAM and FastSAM overall and particularly on relatively small objects. Seeing that FLIP is an end-to-end object-centric segmentation approach, it has high potential particularly for applications that benefit from computationally efficient, spatially highly selective object tracking.
zh

[CV-48] Federated Low-Rank Tensor Estimation for Multimodal Image Reconstruction

【速读】：该论文旨在解决高维数据挑战下的图像重建问题，特别是在噪声或欠采样条件下。解决方案的关键在于提出了一种基于联邦学习（Federated Learning, FL）的图像重建方法，该方法利用Tucker分解，并结合联合因子化和随机化简图技术来处理大规模多模态数据。这种方法避免了重建全尺寸张量，并支持异构秩，允许客户端根据先验知识或通信能力选择个性化的分解秩。实验结果表明，该方法在重建质量和通信压缩方面优于现有方法，展示了其在FL环境下多模态逆问题中的潜力。

链接: https://arxiv.org/abs/2502.02761
作者: Anh Van Nguyen,Diego Klabjan,Minseok Ryu,Kibaek Kim,Zichao Di
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Low-rank tensor estimation offers a powerful approach to addressing high-dimensional data challenges and can substantially improve solutions to ill-posed inverse problems, such as image reconstruction under noisy or undersampled conditions. Meanwhile, tensor decomposition has gained prominence in federated learning (FL) due to its effectiveness in exploiting latent space structure and its capacity to enhance communication efficiency. In this paper, we present a federated image reconstruction method that applies Tucker decomposition, incorporating joint factorization and randomized sketching to manage large-scale, multimodal data. Our approach avoids reconstructing full-size tensors and supports heterogeneous ranks, allowing clients to select personalized decomposition ranks based on prior knowledge or communication capacity. Numerical results demonstrate that our method achieves superior reconstruction quality and communication compression compared to existing approaches, thereby highlighting its potential for multimodal inverse problems in the FL setting.
zh

[CV-49] RFMedSAM 2: Automatic Prompt Refinement for Enhanced Volumetric Medical Image Segmentation with SAM 2

【速读】：该论文旨在解决Segment Anything Model 2 (SAM 2) 在医学图像分割任务中的性能限制及其对精确提示的依赖问题。论文的关键解决方案包括使用定制的微调适配器（custom fine-tuning adapters）提升其性能至Dice Similarity Coefficient (DSC) 92.30%，以及引入一个UNet模型来自动生成预测掩膜和边界框作为输入，以减少对精确提示的依赖，并通过双阶段的后处理进一步增强分割效果。这些方法使得该模型在AMOS2022数据集上达到当前最优性能，Dice评分提升了2.9%，并在BTCV数据集上超越了nnUNet 6.4%。

链接: https://arxiv.org/abs/2502.02741
作者: Bin Xie,Hao Tang,Yan Yan,Gady Agam
机构: Department of Computer Science, Illinois Institute of Technology(伊利诺伊理工学院), USA; School of Computer Science, Peking University(北京大学), China; Department of Computer Science, University of Illinois Chicago(芝加哥大学伊利诺伊分校), USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor. Building on SAM’s success in medical image segmentation, SAM 2 presents significant potential for further advancement. However, similar to SAM, SAM 2 is limited by its output of binary masks, inability to infer semantic labels, and dependence on precise prompts for the target object area. Additionally, direct application of SAM and SAM 2 to medical image segmentation tasks yields suboptimal results. In this paper, we explore the upper performance limit of SAM 2 using custom fine-tuning adapters, achieving a Dice Similarity Coefficient (DSC) of 92.30% on the BTCV dataset, surpassing the state-of-the-art nnUNet by 12%. Following this, we address the prompt dependency by investigating various prompt generators. We introduce a UNet to autonomously generate predicted masks and bounding boxes, which serve as input to SAM 2. Subsequent dual-stage refinements by SAM 2 further enhance performance. Extensive experiments show that our method achieves state-of-the-art results on the AMOS2022 dataset, with a Dice improvement of 2.9% compared to nnUNet, and outperforms nnUNet by 6.4% on the BTCV dataset.
zh

[CV-50] Multiple Instance Learning with Coarse-to-Fine Self-Distillation

【速读】：该论文旨在解决全切片图像（Whole Slide Image, WSI）分析中多实例学习（Multiple Instance Learning, MIL）通常忽视实例级监督的问题，即监督信号仅在袋级别（bag level）提供。论文的关键解决方案在于提出PathMIL框架，通过两个视角改进MIL：(1) 引入实例级监督，(2) 在袋级别学习实例间的上下文信息。具体而言，PathMIL采用粗到细自蒸馏（Coarse-to-Fine Self-Distillation, CFSD）范式来探查并蒸馏以袋级别信息训练的分类器，从而获得实例级标签，提供更精细的监督；同时，引入二维位置编码（Two-Dimensional Positional Encoding, 2DPE）来捕捉WSI中的实例间空间上下文信息。这些方法使得PathMIL在各类基准任务上实现了当前最优性能。

链接: https://arxiv.org/abs/2502.02707
作者: Shuyang Wu,Yifu Qiu,Ines P. Nearchou,Sandrine Prost,Jonathan A. Fallowfield,Hakan Bilen,Timothy J. Kendall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiple Instance Learning (MIL) for whole slide image (WSI) analysis in computational pathology often neglects instance-level learning as supervision is typically provided only at the bag level. In this work, we present PathMIL, a framework designed to improve MIL through two perspectives: (1) employing instance-level supervision and (2) learning inter-instance contextual information on bag level. Firstly, we propose a novel Coarse-to-Fine Self-Distillation (CFSD) paradigm, to probe and distil a classifier trained with bag-level information to obtain instance-level labels which could effectively provide the supervision for the same classifier in a finer way. Secondly, to capture inter-instance contextual information in WSI, we propose Two-Dimensional Positional Encoding (2DPE), which encodes the spatial appearance of instances within a bag. We also theoretically and empirically prove the instance-level learnability of CFSD. PathMIL is evaluated on multiple benchmarking tasks, including subtype classification (TCGA-NSCLC), tumour classification (CAMELYON16), and an internal benchmark for breast cancer receptor status classification. Our method achieves state-of-the-art performance, with AUC scores of 0.9152 and 0.8524 for estrogen and progesterone receptor status classification, respectively, an AUC of 0.9618 for subtype classification, and 0.8634 for tumour classification, surpassing existing methods.
zh

[CV-51] Intelligent Sensing-to-Action for Robust Autonomy at the Edge: Opportunities and Challenges

【速读】：该论文旨在解决在动态环境中实时决策所面临的挑战，特别是在机器人、智慧城市及自动驾驶车辆中的自主边缘计算问题。论文的关键在于提出通过主动、上下文感知的传感至行动（Sensing-to-Action）和行动至传感（Action-to-Sensing）的自适应调整机制，以动态方式根据任务需求调节传感与计算资源。这种方法能够提高效率，同时保持系统的可靠性和鲁棒性。此外，多智能体传感-行动循环通过分布式智能体间的协同工作进一步扩展这些能力，优化资源利用。论文还强调了端到端协同设计策略的重要性，确保算法模型与硬件及环境动力学相匹配，从而提升跨层互依赖性，增强吞吐量、精度和适应性，实现复杂环境下的能源高效边缘自治。

链接: https://arxiv.org/abs/2502.02692
作者: Amit Ranjan Trivedi,Sina Tayebati,Hemant Kumawat,Nastaran Darabi,Divake Kumar,Adarsh Kumar Kosta,Yeshwanth Venkatesha,Dinithi Jayasuriya,Nethmi Jayasinghe,Priyadarshini Panda,Saibal Mukhopadhyay,Kaushik Roy
机构: University of Illinois at Chicago(芝加哥伊利诺伊大学); Georgia Institute of Technology(乔治亚理工学院); Yale University(耶鲁大学); Purdue University(普渡大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous edge computing in robotics, smart cities, and autonomous vehicles relies on the seamless integration of sensing, processing, and actuation for real-time decision-making in dynamic environments. At its core is the sensing-to-action loop, which iteratively aligns sensor inputs with computational models to drive adaptive control strategies. These loops can adapt to hyper-local conditions, enhancing resource efficiency and responsiveness, but also face challenges such as resource constraints, synchronization delays in multi-modal data fusion, and the risk of cascading errors in feedback loops. This article explores how proactive, context-aware sensing-to-action and action-to-sensing adaptations can enhance efficiency by dynamically adjusting sensing and computation based on task demands, such as sensing a very limited part of the environment and predicting the rest. By guiding sensing through control actions, action-to-sensing pathways can improve task relevance and resource use, but they also require robust monitoring to prevent cascading errors and maintain reliability. Multi-agent sensing-action loops further extend these capabilities through coordinated sensing and actions across distributed agents, optimizing resource use via collaboration. Additionally, neuromorphic computing, inspired by biological systems, provides an efficient framework for spike-based, event-driven processing that conserves energy, reduces latency, and supports hierarchical control–making it ideal for multi-agent optimization. This article highlights the importance of end-to-end co-design strategies that align algorithmic models with hardware and environmental dynamics and improve cross-layer interdependencies to improve throughput, precision, and adaptability for energy-efficient edge autonomy in complex environments.
zh

[CV-52] Controllable Video Generation with Provable Disentanglement

【速读】：该论文旨在解决可控视频生成中的挑战，特别是现有方法在处理视频整体时忽视了复杂的细粒度时空关系，从而限制了控制精度和效率。论文的关键解决方案是提出Controllable Video Generative Adversarial Networks (CoVoGAN)，通过解耦视频概念，实现对个体概念的高效独立控制。具体而言，论文首先遵循最小变化原则解耦静态和动态潜在变量，并利用充分变化特性实现动态潜在变量的组件可识别性，从而实现对运动和身份的独立控制。

链接: https://arxiv.org/abs/2502.02690
作者: Yifan Shen,Peiyuan Zhu,Zijian Li,Shaoan Xie,Zeyu Tang,Namrata Deka,Zongfang Liu,Guangyi Chen,Kun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
zh

[CV-53] Blind Visible Watermark Removal with Morphological Dilation

【速读】：该论文旨在解决可见水印对图像恢复技术带来的挑战，特别是在目标背景未知的情况下。论文提出了一种名为MorphoMod的新方法，用于在盲设（无需目标图像）条件下自动移除可见水印。解决方案的关键在于MorphoMod能够有效移除不透明和透明水印，同时保持语义内容的完整性，使其适用于实际应用，并在多个基准数据集上实现了比现有最先进方法高达50.8%的水印移除效果提升。

链接: https://arxiv.org/abs/2502.02676
作者: Preston K. Robinette,Taylor T. Johnson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visible watermarks pose significant challenges for image restoration techniques, especially when the target background is unknown. Toward this end, we present MorphoMod, a novel method for automated visible watermark removal that operates in a blind setting – without requiring target images. Unlike existing methods, MorphoMod effectively removes opaque and transparent watermarks while preserving semantic content, making it well-suited for real-world applications. Evaluations on benchmark datasets, including the Colored Large-scale Watermark Dataset (CLWD), LOGO-series, and the newly introduced Alpha1 datasets, demonstrate that MorphoMod achieves up to a 50.8% improvement in watermark removal effectiveness compared to state-of-the-art methods. Ablation studies highlight the impact of prompts used for inpainting, pre-removal filling strategies, and inpainting model performance on watermark removal. Additionally, a case study on steganographic disorientation reveals broader applications for watermark removal in disrupting high-level hidden messages. MorphoMod offers a robust, adaptable solution for watermark removal and opens avenues for further advancements in image restoration and adversarial manipulation.
zh

[CV-54] SiLVR: Scalable Lidar-Visual Radiance Field Reconstruction with Uncertainty Quantification

【速读】：本文旨在解决大规模场景高精度几何重建与真实感纹理捕捉的问题。关键在于采用神经辐射场（NeRF）模型融合激光雷达（LiDAR）和视觉数据，通过引入LiDAR数据增强深度和表面法线的几何约束，并估计辐射场中每个点位置的空间方差以量化重建的不确定性。此外，利用实时定位与建图（SLAM）系统生成的轨迹来引导后处理的从运动恢复结构（SfM）过程，从而显著减少SfM训练时间并确保整体度量尺度的一致性。最终通过光束谱聚类方法将全局一致的轨迹分割成子地图，实现更有效的视觉重建。

链接: https://arxiv.org/abs/2502.02657
作者: Yifu Tao,Maurice Fallon
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: webpage: this https URL

点击查看摘要

Abstract:We present a neural radiance field (NeRF) based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photorealistic texture. Our system adopts the state-of-the-art NeRF representation to additionally incorporate lidar. Adding lidar data adds strong geometric constraints on the depth and surface normals, which is particularly useful when modelling uniform texture surfaces which contain ambiguous visual reconstruction cues. Furthermore, we estimate the epistemic uncertainty of the reconstruction as the spatial variance of each point location in the radiance field given the sensor observations from camera and lidar. This enables the identification of areas that are reliably reconstructed by each sensor modality, allowing the map to be filtered according to the estimated uncertainty. Our system can also exploit the trajectory produced by a real-time pose-graph lidar SLAM system during online mapping to bootstrap a (post-processed) Structure-from-Motion (SfM) reconstruction procedure reducing SfM training time by up to 70%. It also helps to properly constrain the overall metric scale which is essential for the lidar depth loss. The globally-consistent trajectory can then be divided into submaps using Spectral Clustering to group sets of co-visible images together. This submapping approach is more suitable for visual reconstruction than distance-based partitioning. Each submap is filtered according to point-wise uncertainty estimates and merged to obtain the final large-scale 3D reconstruction. We demonstrate the reconstruction system using a multi-camera, lidar sensor suite in experiments involving both robot-mounted and handheld scanning. Our test datasets cover a total area of more than 20,000 square metres, including multiple university buildings and an aerial survey of a multi-storey.
zh

[CV-55] Deep Learning-Based Facial Expression Recognition for the Elderly: A Systematic Review

【速读】：该论文旨在解决在老年人群体中应用面部表情识别（Facial Expression Recognition, FER）系统所面临的挑战，包括缺乏专门针对老年人的数据集、类别不平衡以及与年龄相关的面部表情差异。论文的关键解决方案在于强调开发包容各年龄段的数据集、整合多模态解决方案，并采用可解释的人工智能（Explainable Artificial Intelligence, XAI）技术以增强系统的可用性、可靠性和可信度。

链接: https://arxiv.org/abs/2502.02618
作者: F. Xavier Gaya-Morey,Jose M. Buades-Rubio,Philippe Palanque,Raquel Lacuesta,Cristina Manresa-Yee
机构: uib.es ( UIB ); irit.fr ( IRIT ); unizar.es ( Unizar )
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid aging of the global population has highlighted the need for technologies to support elderly, particularly in healthcare and emotional well-being. Facial expression recognition (FER) systems offer a non-invasive means of monitoring emotional states, with applications in assisted living, mental health support, and personalized care. This study presents a systematic review of deep learning-based FER systems, focusing on their applications for the elderly population. Following a rigorous methodology, we analyzed 31 studies published over the last decade, addressing challenges such as the scarcity of elderly-specific datasets, class imbalances, and the impact of age-related facial expression differences. Our findings show that convolutional neural networks remain dominant in FER, and especially lightweight versions for resource-constrained environments. However, existing datasets often lack diversity in age representation, and real-world deployment remains limited. Additionally, privacy concerns and the need for explainable artificial intelligence emerged as key barriers to adoption. This review underscores the importance of developing age-inclusive datasets, integrating multimodal solutions, and adopting XAI techniques to enhance system usability, reliability, and trustworthiness. We conclude by offering recommendations for future research to bridge the gap between academic progress and real-world implementation in elderly care.
zh

[CV-56] Secure Personalized Music-to-Video Generation via CHARCHA NEURIPS2024

【速读】：该论文旨在解决如何通过全自动管道生成个性化音乐视频的问题，使听众不仅能成为消费者，还能成为音乐视频创作过程中的共同创作者。解决方案的关键在于结合多模态翻译和生成技术，并利用低秩适应处理听众图像，以创建与音乐和个人特征相契合且沉浸式的音乐视频。此外，为了确保用户身份的伦理使用，引入了名为CHARCHA（专利申请中）的面部身份验证协议，该协议在收集授权图像用于个性化视频的同时，保护人们免受未经授权使用其面部形象。

链接: https://arxiv.org/abs/2502.02610
作者: Mehul Agarwal,Gauri Agarwal,Santiago Benoit,Andrew Lippman,Jean Oh
机构: Carnegie Mellon University(卡内基梅隆大学); MIT Media Lab(麻省理工学院媒体实验室); LTI, Carnegie Mellon University(卡内基梅隆大学语言技术研究所); RI, Carnegie Mellon University(卡内基梅隆大学机器人研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: NeurIPS 2024 Creative AI Track

点击查看摘要

Abstract:Music is a deeply personal experience and our aim is to enhance this with a fully-automated pipeline for personalized music video generation. Our work allows listeners to not just be consumers but co-creators in the music video generation process by creating personalized, consistent and context-driven visuals based on lyrics, rhythm and emotion in the music. The pipeline combines multimodal translation and generation techniques and utilizes low-rank adaptation on listeners’ images to create immersive music videos that reflect both the music and the individual. To ensure the ethical use of users’ identity, we also introduce CHARCHA (patent pending), a facial identity verification protocol that protects people against unauthorized use of their face while at the same time collecting authorized images from users for personalizing their videos. This paper thus provides a secure and innovative framework for creating deeply personalized music videos.
zh

[CV-57] MIND: Microstructure INverse Design with Generative Hybrid Neural Representation

【速读】：该论文旨在解决微结构逆向设计中精确控制几何形状与材料属性的问题。传统正向设计方法受限于其无法探索广阔的组合设计空间，而逆向设计虽提供了一种替代方案，但依然面临几何形状与材料属性之间复杂相互依赖性的挑战。论文的关键解决方案在于提出一种新颖的生成模型，该模型结合潜伏扩散与Holoplane高级混合神经表示，能够同时编码几何和物理属性，从而确保几何与属性之间的高度一致性。此外，该方法能够生成多样且可平铺的微结构，显著提高性能精度，并增强对几何有效性的控制，超越现有方法的表现。

链接: https://arxiv.org/abs/2502.02607
作者: Tianyang Xue,Haochen Li,Longdu Liu,Paul Henderson,Pengbin Tang,Lin Lu,Jikai Liu,Haisen Zhao,Hao Peng,Bernd Bickel
机构: Shandong University(山东大学); University of Glasgow(格拉斯哥大学); ETH Zurich(瑞士苏黎世联邦理工学院); CrownCAD(冠状CAD)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The inverse design of microstructures plays a pivotal role in optimizing metamaterials with specific, targeted physical properties. While traditional forward design methods are constrained by their inability to explore the vast combinatorial design space, inverse design offers a compelling alternative by directly generating structures that fulfill predefined performance criteria. However, achieving precise control over both geometry and material properties remains a significant challenge due to their intricate interdependence. Existing approaches, which typically rely on voxel or parametric representations, often limit design flexibility and structural diversity. In this work, we present a novel generative model that integrates latent diffusion with Holoplane, an advanced hybrid neural representation that simultaneously encodes both geometric and physical properties. This combination ensures superior alignment between geometry and properties. Our approach generalizes across multiple microstructure classes, enabling the generation of diverse, tileable microstructures with significantly improved property accuracy and enhanced control over geometric validity, surpassing the performance of existing methods. We introduce a multi-class dataset encompassing a variety of geometric morphologies, including truss, shell, tube, and plate structures, to train and validate our model. Experimental results demonstrate the model’s ability to generate microstructures that meet target properties, maintain geometric validity, and integrate seamlessly into complex assemblies. Additionally, we explore the potential of our framework through the generation of new microstructures, cross-class interpolation, and the infilling of heterogeneous microstructures. The dataset and source code will be open-sourced upon publication.
zh

[CV-58] Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans

【速读】：该论文旨在开发和评估一种基于深度学习的方法，实现心肌梗死（Myocardial Infarct）在心脏磁共振（Cardiac Magnetic Resonance, CMR）晚期钆增强（Late Gadolinium Enhancement, LGE）图像中的全自动分割。解决方案的关键在于采用一个级联框架，包括二维和三维卷积神经网络（Convolutional Neural Networks, CNNs），专门用于识别缺血性心肌疤痕。该方法在无需输入图像预处理的情况下，实现了与训练有素的人类观察者相匹配的分割质量，并且在盲测实验中，专家更倾向于自动分割结果。

链接: https://arxiv.org/abs/2502.03272
作者: Matthias Schwab,Mathias Pamminger,Christian Kremser,Agnes Mayr
机构: Medical University of Innsbruck (因斯布鲁克医科大学); University of Innsbruck (因斯布鲁克大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: To develop and evaluate a deep learning-based method that allows to perform myocardial infarct segmentation in a fully-automated way. Materials and Methods: For this retrospective study, a cascaded framework of two and three-dimensional convolutional neural networks (CNNs), specialized on identifying ischemic myocardial scars on late gadolinium enhancement (LGE) cardiac magnetic resonance (CMR) images, was trained on an in-house training dataset consisting of 144 examinations. On a separate test dataset from the same institution, including images from 152 examinations obtained between 2021 and 2023, a quantitative comparison between artificial intelligence (AI)-based segmentations and manual segmentations was performed. Further, qualitative assessment of segmentation accuracy was evaluated for both human and AI-generated contours by two CMR experts in a blinded experiment. Results: Excellent agreement could be found between manually and automatically calculated infarct volumes ( \rho_c = 0.9). The qualitative evaluation showed that compared to human-based measurements, the experts rated the AI-based segmentations to better represent the actual extent of infarction significantly (p 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On the contrary, for segmentation of microvascular obstruction (MVO), manual measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal). Conclusion: This fully-automated segmentation pipeline enables CMR infarct size to be calculated in a very short time and without requiring any pre-processing of the input images while matching the segmentation quality of trained human observers. In a blinded experiment, experts preferred automated infarct segmentations more often than manual segmentations, paving the way for a potential clinical application. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.03272 [eess.IV] (or arXiv:2502.03272v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2502.03272 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matthias Schwab [view email] [v1] Wed, 5 Feb 2025 15:29:28 UTC (10,518 KB)
zh

[CV-59] Learning Generalizable Features for Tibial Plateau Fracture Segmentation Using Masked Autoencoder and Limited Annotations

【速读】：该论文旨在解决胫骨平台骨折（Tibial Plateau Fracture, TPF）在计算机断层扫描（CT）图像中自动分割所需的大规模标注数据获取困难的问题。解决方案的关键在于提出了一种基于掩码自编码器（Masked Autoencoder, MAE）预训练的高效训练策略。该方法利用MAE从无标注数据中捕捉全局骨骼结构和细微骨折细节，并通过少量标注数据进行微调，从而减少了对大量标注数据的依赖，同时提升了模型学习可泛化和可迁移特征的能力。

链接: https://arxiv.org/abs/2502.02862
作者: Peiyan Yue,Die Cai,Chu Guo,Mengxing Liu,Jun Xia,Yi Wang
机构: Smart Medical Imaging, Learning and Engineering (SMILE) Lab, Medical UltraSound Image Computing (MUSIC) Lab, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University (深圳大学); Department of Radiology, The First Affiliated Hospital of Shenzhen University, Shenzhen University, Shenzhen Second People’s Hospital (深圳第二人民医院); Shenzhen Mindray Bio-Medical Electronics Co., Ltd (深圳迈瑞生物医疗电子股份有限公司); Wuhan Mindray Scientific Co., Ltd (武汉迈瑞科学仪器有限公司)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 6 figures

点击查看摘要

Abstract:Accurate automated segmentation of tibial plateau fractures (TPF) from computed tomography (CT) requires large amounts of annotated data to train deep learning models, but obtaining such annotations presents unique challenges. The process demands expert knowledge to identify diverse fracture patterns, assess severity, and account for individual anatomical variations, making the annotation process highly time-consuming and expensive. Although semi-supervised learning methods can utilize unlabeled data, existing approaches often struggle with the complexity and variability of fracture morphologies, as well as limited generalizability across datasets. To tackle these issues, we propose an effective training strategy based on masked autoencoder (MAE) for the accurate TPF segmentation in CT. Our method leverages MAE pretraining to capture global skeletal structures and fine-grained fracture details from unlabeled data, followed by fine-tuning with a small set of labeled data. This strategy reduces the dependence on extensive annotations while enhancing the model’s ability to learn generalizable and transferable features. The proposed method is evaluated on an in-house dataset containing 180 CT scans with TPF. Experimental results demonstrate that our method consistently outperforms semi-supervised methods, achieving an average Dice similarity coefficient (DSC) of 95.81%, average symmetric surface distance (ASSD) of 1.91mm, and Hausdorff distance (95HD) of 9.42mm with only 20 annotated cases. Moreover, our method exhibits strong transferability when applying to another public pelvic CT dataset with hip fractures, highlighting its potential for broader applications in fracture segmentation tasks.
zh

[CV-60] When are Diffusion Priors Helpful in Sparse Reconstruction? A Study with Sparse-view CT

【速读】：该论文旨在探讨扩散模型（Diffusion Models）作为先验知识在图像重建中的效用，特别是在稀疏医学图像重建任务中的表现。论文通过改变观测数量，并与经典的先验方法（如稀疏先验和Tikhonov正则化）进行比较，使用基于像素、结构以及下游任务的指标评估其性能。研究聚焦于低剂量胸部壁计算机断层扫描（CT）中的脂肪质量定量。

论文的关键发现是：当投影数量“充足”时，经典先验优于扩散先验；然而，扩散先验在极少数观测条件下能够捕获大量细节，显著优于经典先验；尽管如此，它们即使在大量观测下也无法捕捉所有细节。最终，扩散先验的性能在极少数（约10-15个）投影后趋于平稳。论文强调了基于扩散的稀疏重建潜在问题，并突显了进一步研究的重要性，尤其是在高风险临床环境中。

链接: https://arxiv.org/abs/2502.02771
作者: Matt Y. Cheung,Sophia Zorek,Tucker J. Netherton,Laurence E. Court,Sadeer Al-Kindi,Ashok Veeraraghavan,Guha Balakrishnan
机构: Rice University (莱斯大学); The University of Texas MD Anderson Cancer Center (德克萨斯大学MD安德森癌症中心); DeBakey Heart and Vascular Center (德贝基心脏和血管中心)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Applications (stat.AP)
备注: Accepted at IEEE ISBI 2025, 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:Diffusion models demonstrate state-of-the-art performance on image generation, and are gaining traction for sparse medical image reconstruction tasks. However, compared to classical reconstruction algorithms relying on simple analytical priors, diffusion models have the dangerous property of producing realistic looking results \empheven when incorrect, particularly with few observations. We investigate the utility of diffusion models as priors for image reconstruction by varying the number of observations and comparing their performance to classical priors (sparse and Tikhonov regularization) using pixel-based, structural, and downstream metrics. We make comparisons on low-dose chest wall computed tomography (CT) for fat mass quantification. First, we find that classical priors are superior to diffusion priors when the number of projections is ``sufficient’'. Second, we find that diffusion priors can capture a large amount of detail with very few observations, significantly outperforming classical priors. However, they fall short of capturing all details, even with many observations. Finally, we find that the performance of diffusion priors plateau after extremely few ( \approx 10-15) projections. Ultimately, our work highlights potential issues with diffusion-based sparse reconstruction and underscores the importance of further investigation, particularly in high-stakes clinical settings.
zh

[CV-61] Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images

【速读】：该论文旨在解决前列腺癌转移病灶在PET/CT扫描中的自动化检测与分割问题。解决方案的关键在于提出了一种新的损失函数——L1-weighted Dice Focal Loss (L1DFL)，该函数通过利用L1范数对体素进行自适应加权，以应对不同分类难度的体素，从而提高病灶检测与分割的准确性。实验结果表明，L1DFL相较于Dice损失函数和Dice Focal Loss，在测试集上的表现至少提升了13%，证明了其优越性。

链接: https://arxiv.org/abs/2502.02756
作者: Obed Korshie Dzikunu,Shadab Ahamed,Amirhossein Toosi,Xiaoxiao Li,Arman Rahmim
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 7 figures, 1 table

点击查看摘要

Abstract:This study proposes a new loss function for deep neural networks, L1-weighted Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of voxels based on their classification difficulty, towards automated detection and segmentation of metastatic prostate cancer lesions in PET/CT scans. We obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with biochemical recurrence metastatic prostate cancer. We trained two 3D convolutional neural networks, Attention U-Net and SegResNet, and concatenated the PET and CT volumes channel-wise as input. The performance of our custom loss function was evaluated against the Dice and Dice Focal Loss functions. For clinical significance, we considered a detected region of interest (ROI) as a true positive if at least the voxel with the maximum standardized uptake value falls within the ROI. We assessed the models’ performance based on the number of lesions in an image, tumour volume, activity, and extent of spread. The L1DFL outperformed the comparative loss functions by at least 13% on the test set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal Loss yielded more false positives, whereas the Dice Loss was more sensitive to smaller volumes and struggled to segment larger lesions accurately. They also exhibited network-specific variations and yielded declines in segmentation accuracy with increased tumour spread. Our results demonstrate the potential of L1DFL to yield robust segmentation of metastatic prostate cancer lesions in PSMA PET/CT images. The results further highlight potential complexities arising from the variations in lesion characteristics that may influence automated prostate cancer tumour detection and segmentation. The code is publicly available at: this https URL.
zh

[CV-62] Muographic Image Upsampling with Machine Learning for Built Infrastructure Applications

【速读】：该论文旨在解决现有非破坏性评估方法在检测老化关键基础设施（如桥梁）方面存在的不足。关键解决方案在于开发了一种双模型深度学习方法，通过条件Wasserstein生成对抗网络带梯度惩罚（cWGAN-GP）进行预测性上采样，显著提升了获取速度和图像质量，同时通过量化评估模型改进了混凝土样本特征的分割精度，有效缓解了z平面模糊伪影问题。

链接: https://arxiv.org/abs/2502.02624
作者: William O’Donnell,David Mahon,Guangliang Yang,Simon Gardner
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The civil engineering industry faces a critical need for innovative non-destructive evaluation methods, particularly for ageing critical infrastructure, such as bridges, where current techniques fall short. Muography, a non-invasive imaging technique, constructs three-dimensional density maps by detecting interactions of naturally occurring cosmic-ray muons within the scanned volume. Cosmic-ray muons provide deep penetration and inherent safety due to their high momenta and natural source. However, the technology’s reliance on this source results in constrained muon flux, leading to prolonged acquisition times, noisy reconstructions and image interpretation challenges. To address these limitations, we developed a two-model deep learning approach. First, we employed a conditional Wasserstein generative adversarial network with gradient penalty (cWGAN-GP) to perform predictive upsampling of undersampled muography images. Using the structural similarity index measure (SSIM), 1-day sampled images matched the perceptual qualities of a 21-day image, while the peak signal-to-noise ratio (PSNR) indicated noise improvement equivalent to 31 days of sampling. A second cWGAN-GP model, trained for semantic segmentation, quantitatively assessed the upsampling model’s impact on concrete sample features. This model achieved segmentation of rebar grids and tendon ducts, with Dice-Sørensen accuracy coefficients of 0.8174 and 0.8663. Notably, it could mitigate or remove z-plane smearing artifacts caused by muography’s inverse imaging problem. Both models were trained on a comprehensive Geant4 Monte-Carlo simulation dataset reflecting realistic civil infrastructure scenarios. Our results demonstrate significant improvements in acquisition speed and image quality, marking a substantial step toward making muography more practical for reinforced concrete infrastructure monitoring applications.
zh

人工智能

[AI-0] A Schema-Guided Reason -while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLM s)

链接: https://arxiv.org/abs/2502.03450
作者: Yiye Chen,Harpreet Sawhney,Nicholas Gydé,Yanan Jian,Jack Saunders,Patricio Vela,Ben Lundell
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Scene graphs have emerged as a structured and serializable environment representation for grounded spatial reasoning with Large Language Models (LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason framework for reasoning and planning with scene graphs. Our approach employs two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and information queries generation, and a (2) Retriever for extracting corresponding graph information following the queries. Two agents collaborate iteratively, enabling sequential reasoning and adaptive attention to graph information. Unlike prior works, both agents are prompted only with the scene graph schema rather than the full graph data, which reduces the hallucination by limiting input tokens, and drives the Reasoner to generate reasoning trace this http URL the trace, the Retriever programmatically query the scene graph data based on the schema understanding, allowing dynamic and global attention on the graph that enhances alignment between reasoning and retrieval. Through experiments in multiple simulation environments, we show that our framework surpasses existing LLM-based approaches in numerical Q\A and planning tasks, and can benefit from task-level few-shot examples, even in the absence of agent-level demonstrations. Project code will be released.

[AI-1] BFS-Prover: Scalable Best-First Tree Search for LLM -based Automatic Theorem Proving

链接: https://arxiv.org/abs/2502.03438
作者: Ran Xin,Chenguang Xi,Jie Yang,Feng Chen,Hang Wu,Xia Xiao,Yifan Sun,Shen Zheng,Kai Shen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating proof search spaces. While the existing approaches primarily rely on value functions and Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Search (BFS) remains underexplored. This paper investigates whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present \textttBFS-Prover, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM’s policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. \textttBFS-Prover achieves a score of 71.31 on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled.

[AI-2] Lightweight Authenticated Task Offloading in 6G-Cloud Vehicular Twin Networks

链接: https://arxiv.org/abs/2502.03403
作者: Sarah Al-Shareeda,Fusun Ozguner,Keith Redmill,Trung Q. Duong,Berk Canberk
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 figures, IEEE Wireless Communications and Networking Conference (WCNC2025), Milan, Italy, 24-27 March 2025

点击查看摘要

Abstract:Task offloading management in 6G vehicular networks is crucial for maintaining network efficiency, particularly as vehicles generate substantial data. Integrating secure communication through authentication introduces additional computational and communication overhead, significantly impacting offloading efficiency and latency. This paper presents a unified framework incorporating lightweight Identity-Based Cryptographic (IBC) authentication into task offloading within cloud-based 6G Vehicular Twin Networks (VTNs). Utilizing Proximal Policy Optimization (PPO) in Deep Reinforcement Learning (DRL), our approach optimizes authenticated offloading decisions to minimize latency and enhance resource allocation. Performance evaluation under varying network sizes, task sizes, and data rates reveals that IBC authentication can reduce offloading efficiency by up to 50% due to the added overhead. Besides, increasing network size and task size can further reduce offloading efficiency by up to 91.7%. As a countermeasure, increasing the transmission data rate can improve the offloading performance by as much as 63%, even in the presence of authentication overhead. The code for the simulations and experiments detailed in this paper is available on GitHub for further reference and reproducibility [1].

[AI-3] Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin

链接: https://arxiv.org/abs/2502.03396
作者: Sarah Al-Shareeda,Yasar Celik,Bilge Bilgili,Ahmed Al-Dubai,Berk Canberk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 8 pages, 8 figures, 5th IEEE Middle East North Africa COMMunications Conference (MENACOMM’25), Lebanon Feb 20-23, 2025

点击查看摘要

Abstract:Creating a Digital Twin (DT) for Healthcare Intelligent Transportation Systems (HITS) is a hot research trend focusing on enhancing HITS management, particularly in emergencies where ambulance vehicles must arrive at the crash scene on time and track their real-time location is crucial to the medical authorities. Despite the claim of real-time representation, a temporal misalignment persists between the physical and virtual domains, leading to discrepancies in the ambulance’s location representation. This study proposes integrating AI predictive models, specifically Support Vector Regression (SVR) and Deep Neural Networks (DNN), within a constructed mock DT data pipeline framework to anticipate the medical vehicle’s next location in the virtual world. These models align virtual representations with their physical counterparts, i.e., metaphorically offsetting the synchronization delay between the two worlds. Trained meticulously on a historical geospatial dataset, SVR and DNN exhibit exceptional prediction accuracy in MATLAB and Python environments. Through various testing scenarios, we visually demonstrate the efficacy of our methodology, showcasing SVR and DNN’s key role in significantly reducing the witnessed gap within the HITS’s DT. This transformative approach enhances real-time synchronization in emergency HITS by approximately 88% to 93%.

[AI-4] Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications

链接: https://arxiv.org/abs/2502.03395
作者: Issar Arab,Rodrigo Benitez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is essential for operational intelligence in the hospitality industry, and particularly challenging in large-scale, distributed systems. This study evaluates the performance of statistical, machine learning (ML), deep learning, and foundation models in forecasting hourly sales over a 14-day horizon using real-world data from a network of thousands of restaurants across Germany. The forecasting solution includes features such as weather conditions, calendar events, and time-of-day patterns. Results demonstrate the strong performance of ML-based meta-models and highlight the emerging potential of foundation models like Chronos and TimesFM, which deliver competitive performance with minimal feature engineering, leveraging only the pre-trained model (zero-shot inference). Additionally, a hybrid PySpark-Pandas approach proves to be a robust solution for achieving horizontal scalability in large-scale deployments.

[AI-5] ransformers and Their Roles as Time Series Foundation Models

链接: https://arxiv.org/abs/2502.03383
作者: Dennis Wu,Yihan He,Yuan Cao,Jianqing Fan,Han Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 34 Pages, 2 Figures

点击查看摘要

Abstract:We give a comprehensive analysis of transformers as time series foundation models, focusing on their approximation and generalization capabilities. First, we demonstrate that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent. We then analyze MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates. We prove that it is capable of automatically fitting autoregressive models with an arbitrary number of covariates, offering insights into its design and empirical success. For generalization, we establish bounds for pretraining when the data satisfies Dobrushin’s condition. Experiments support our theoretical findings, highlighting the efficacy of transformers as time series foundation models.

[AI-6] Learning from Active Human Involvement through Proxy Value Propagation NEURIPS2023

链接: https://arxiv.org/abs/2502.03369
作者: Zhenghao Peng,Wenjie Mo,Chenda Duan,Quanyi Li,Bolei Zhou
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: NeurIPS 2023 Spotlight. Project page: this https URL

点击查看摘要

Abstract:Learning from active human involvement enables the human subject to actively intervene and demonstrate to the AI agent during training. The interaction and corrective feedback from human brings safety and AI alignment to the learning process. In this work, we propose a new reward-free active human involvement method called Proxy Value Propagation for policy optimization. Our key insight is that a proxy value function can be designed to express human intents, wherein state-action pairs in the human demonstration are labeled with high values, while those agents’ actions that are intervened receive low values. Through the TD-learning framework, labeled values of demonstrated state-action pairs are further propagated to other unlabeled data generated from agents’ exploration. The proxy value function thus induces a policy that faithfully emulates human behaviors. Human-in-the-loop experiments show the generality and efficiency of our method. With minimal modification to existing reinforcement learning algorithms, our method can learn to solve continuous and discrete control tasks with various human control devices, including the challenging task of driving in Grand Theft Auto V. Demo video and code are available at: this https URL

[AI-7] PalimpChat: Declarative and Interactive AI analytics

链接: https://arxiv.org/abs/2502.03368
作者: Chunwei Liu,Gerardo Vitagliano,Brandon Rose,Matt Prinz,David Andrew Samson,Michael Cafarella
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Thanks to the advances in generative architectures and large language models, data scientists can now code pipelines of machine-learning operations to process large collections of unstructured data. Recent progress has seen the rise of declarative AI frameworks (e.g., Palimpzest, Lotus, and DocETL) to build optimized and increasingly complex pipelines, but these systems often remain accessible only to expert programmers. In this demonstration, we present PalimpChat, a chat-based interface to Palimpzest that bridges this gap by letting users create and run sophisticated AI pipelines through natural language alone. By integrating Archytas, a ReAct-based reasoning agent, and Palimpzest’s suite of relational and LLM-based operators, PalimpChat provides a practical illustration of how a chat interface can make declarative AI frameworks truly accessible to non-experts. Our demo system is publicly available online. At SIGMOD’25, participants can explore three real-world scenarios–scientific discovery, legal discovery, and real estate search–or apply PalimpChat to their own datasets. In this paper, we focus on how PalimpChat, supported by the Palimpzest optimizer, simplifies complex AI workflows such as extracting and analyzing biomedical data. Subjects: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2502.03368 [cs.AI] (or arXiv:2502.03368v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.03368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-8] Robust Autonomy Emerges from Self-Play

链接: https://arxiv.org/abs/2502.03349
作者: Marco Cusumano-Towner,David Hafner,Alex Hertzberg,Brody Huval,Aleksei Petrenko,Eugene Vinitsky,Erik Wijmans,Taylor Killian,Stuart Bowers,Ozan Sener,Philipp Krähenbühl,Vladlen Koltun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Self-play has powered breakthroughs in two-player and multi-player games. Here we show that self-play is a surprisingly effective strategy in another domain. We show that robust and naturalistic driving emerges entirely from self-play in simulation at unprecedented scale – 1.6~billion~km of driving. This is enabled by Gigaflow, a batched simulator that can synthesize and train on 42 years of subjective driving experience per hour on a single 8-GPU node. The resulting policy achieves state-of-the-art performance on three independent autonomous driving benchmarks. The policy outperforms the prior state of the art when tested on recorded real-world scenarios, amidst human drivers, without ever seeing human data during training. The policy is realistic when assessed against human references and achieves unprecedented robustness, averaging 17.5 years of continuous driving between incidents in simulation.

[AI-9] Simplifying Formal Proof-Generating Models with ChatGPT and Basic Searching Techniques

链接: https://arxiv.org/abs/2502.03321
作者: Sangjun Han,Taeil Hur,Youngmi Hur,Kathy Sangkyung Lee,Myungyoon Lee,Hyojae Lim
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The challenge of formal proof generation has a rich history, but with modern techniques, we may finally be at the stage of making actual progress in real-life mathematical problems. This paper explores the integration of ChatGPT and basic searching techniques to simplify generating formal proofs, with a particular focus on the miniF2F dataset. We demonstrate how combining a large language model like ChatGPT with a formal language such as Lean, which has the added advantage of being verifiable, enhances the efficiency and accessibility of formal proof generation. Despite its simplicity, our best-performing Lean-based model surpasses all known benchmarks with a 31.15% pass rate. We extend our experiments to include other datasets and employ alternative language models, showcasing our models’ comparable performance in diverse settings and allowing for a more nuanced analysis of our results. Our findings offer insights into AI-assisted formal proof generation, suggesting a promising direction for future research in formal mathematical proof.

[AI-10] STEM: Spatial-Temporal Mapping Tool For Spiking Neural Networks

链接: https://arxiv.org/abs/2502.03287
作者: Sherif Eissa,Sander Stuijk,Floran De Putter,Andrea Nardi-Dei,Federico Corradi,Henk Corporaal
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 24 pages, 23 figures, under review at IEEE TC

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are promising bio-inspired third-generation neural networks. Recent research has trained deep SNN models with accuracy on par with Artificial Neural Networks (ANNs). Although the event-driven and sparse nature of SNNs show potential for more energy efficient computation than ANNs, SNN neurons have internal states which evolve over time. Keeping track of SNN states can significantly increase data movement and storage requirements, potentially losing its advantages with respect to ANNs. This paper investigates the energy effects of having neuron states, and how it is influenced by the chosen mapping to realistic hardware architectures with advanced memory hierarchies. Therefore, we develop STEMS, a mapping design space exploration tool for SNNs. STEMS models SNN’s stateful behavior and explores intra-layer and inter-layer mapping optimizations to minimize data movement, considering both spatial and temporal SNN dimensions. Using STEMS, we show up to 12x reduction in off-chip data movement and 5x reduction in energy (on top of intra-layer optimizations), on two event-based vision SNN benchmarks. Finally, neuron states may not be needed for all SNN layers. By optimizing neuron states for one of our benchmarks, we show 20x reduction in neuron states and 1.4x better performance without accuracy loss.

[AI-11] A Scalable Approach to Probabilistic Neuro-Symbolic Verification

链接: https://arxiv.org/abs/2502.03274
作者: Vasileios Manginas,Nikolaos Manginas,Edward Stevinson,Sherwin Varghese,Nikos Katzouris,Georgios Paliouras,Alessio Lomuscio
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neuro-Symbolic Artificial Intelligence (NeSy AI) has emerged as a promising direction for integrating neural learning with symbolic reasoning. In the probabilistic variant of such systems, a neural network first extracts a set of symbols from sub-symbolic input, which are then used by a symbolic component to reason in a probabilistic manner towards answering a query. In this work, we address the problem of formally verifying the robustness of such NeSy probabilistic reasoning systems, therefore paving the way for their safe deployment in critical domains. We analyze the complexity of solving this problem exactly, and show that it is \mathrmNP^# \mathrmP -hard. To overcome this issue, we propose the first approach for approximate, relaxation-based verification of probabilistic NeSy systems. We demonstrate experimentally that the proposed method scales exponentially better than solver-based solutions and apply our technique to a real-world autonomous driving dataset, where we verify a safety property under large input dimensionalities and network sizes.

[AI-12] he Other Side of the Coin: Unveiling the Downsides of Model Aggregation in Federated Learning from a Layer-peeled Perspective

链接: https://arxiv.org/abs/2502.03231
作者: Guogang Zhu,Xuefeng Liu,Jianwei Niu,Shaojie Tang,Xinghao Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In federated learning (FL), model aggregation is a critical step by which multiple clients share their knowledge with one another. However, it is also widely recognized that the aggregated model, when sent back to each client, performs poorly on local data until after several rounds of local training. This temporary performance drop can potentially slow down the convergence of the FL model. Most research in FL regards this performance drop as an inherent cost of knowledge sharing among clients and does not give it special attention. While some studies directly focus on designing techniques to alleviate the issue, an in-depth investigation of the reasons behind this performance drop has yet to be this http URL address this gap, we conduct a layer-peeled analysis of model aggregation across various datasets and model architectures. Our findings reveal that the performance drop can be attributed to two major consequences of the aggregation process: (1) it disrupts feature variability suppression in deep neural networks (DNNs), and (2) it weakens the coupling between features and subsequent this http URL on these findings, we propose several simple yet effective strategies to mitigate the negative impacts of model aggregation while still enjoying the benefit it brings. To the best of our knowledge, our work is the first to conduct a layer-peeled analysis of model aggregation, potentially paving the way for the development of more effective FL algorithms.

[AI-13] A Unified and General Humanoid Whole-Body Controller for Fine-Grained Locomotion

链接: https://arxiv.org/abs/2502.03206
作者: Yufei Xue,Wentao Dong,Minghuan Liu,Weinan Zhang,Jiangmiao Pang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: The first two authors contribute equally. Project page: this https URL

点击查看摘要

Abstract:Locomotion is a fundamental skill for humanoid robots. However, most existing works made locomotion a single, tedious, unextendable, and passive movement. This limits the kinematic capabilities of humanoid robots. In contrast, humans possess versatile athletic abilities-running, jumping, hopping, and finely adjusting walking parameters such as frequency, and foot height. In this paper, we investigate solutions to bring such versatility into humanoid locomotion and thereby propose HUGWBC: a unified and general humanoid whole-body controller for fine-grained locomotion. By designing a general command space in the aspect of tasks and behaviors, along with advanced techniques like symmetrical loss and intervention training for learning a whole-body humanoid controlling policy in simulation, HugWBC enables real-world humanoid robots to produce various natural gaits, including walking (running), jumping, standing, and hopping, with customizable parameters such as frequency, foot swing height, further combined with different body height, waist rotation, and body pitch, all in one single policy. Beyond locomotion, HUGWBC also supports real-time interventions from external upper-body controllers like teleoperation, enabling loco-manipulation while maintaining precise control under any locomotive behavior. Our experiments validate the high tracking accuracy and robustness of HUGWBC with/without upper-body intervention for all commands, and we further provide an in-depth analysis of how the various commands affect humanoid movement and offer insights into the relationships between these commands. To our knowledge, HugWBC is the first humanoid whole-body controller that supports such fine-grained locomotion behaviors with high robustness and flexibility.

[AI-14] CORTEX: A Cost-Sensitive Rule and Tree Extraction Method

链接: https://arxiv.org/abs/2502.03200
作者: Marija Kopanja,Miloš Savić,Luca Longo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tree-based and rule-based machine learning models play pivotal roles in explainable artificial intelligence (XAI) due to their unique ability to provide explanations in the form of tree or rule sets that are easily understandable and interpretable, making them essential for applications in which trust in model decisions is necessary. These transparent models are typically used in surrogate modeling, a post-hoc XAI approach for explaining the logic of black-box models, enabling users to comprehend and trust complex predictive systems while maintaining competitive performance. This study proposes the Cost-Sensitive Rule and Tree Extraction (CORTEX) method, a novel rule-based XAI algorithm grounded in the multi-class cost-sensitive decision tree (CSDT) method. The original version of the CSDT is extended to classification problems with more than two classes by inducing the concept of an n-dimensional class-dependent cost matrix. The performance of CORTEX as a rule-extractor XAI method is compared to other post-hoc tree and rule extraction methods across several datasets with different numbers of classes. Several quantitative evaluation metrics are employed to assess the explainability of generated rule sets. Our findings demonstrate that CORTEX is competitive with other tree-based methods and can be superior to other rule-based methods across different datasets. The extracted rule sets suggest the advantages of using the CORTEX method over other methods by producing smaller rule sets with shorter rules on average across datasets with a diverse number of classes. Overall, the results underscore the potential of CORTEX as a powerful XAI tool for scenarios that require the generation of clear, human-understandable rules while maintaining good predictive performance.

[AI-15] Gotham Dataset 2025: A Reproducible Large-Scale IoT Network Dataset for Intrusion Detection and Security Research

链接: https://arxiv.org/abs/2502.03134
作者: Othmane Belarbi,Theodoros Spyridopoulos,Eirini Anthi,Omer Rana,Pietro Carnelli,Aftab Khan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures, 4 tables. Submitted at the Data in Brief journal

点击查看摘要

Abstract:In this paper, a dataset of IoT network traffic is presented. Our dataset was generated by utilising the Gotham testbed, an emulated large-scale Internet of Things (IoT) network designed to provide a realistic and heterogeneous environment for network security research. The testbed includes 78 emulated IoT devices operating on various protocols, including MQTT, CoAP, and RTSP. Network traffic was captured in Packet Capture (PCAP) format using tcpdump, and both benign and malicious traffic were recorded. Malicious traffic was generated through scripted attacks, covering a variety of attack types, such as Denial of Service (DoS), Telnet Brute Force, Network Scanning, CoAP Amplification, and various stages of Command and Control (CC) communication. The data were subsequently processed in Python for feature extraction using the Tshark tool, and the resulting data was converted to Comma Separated Values (CSV) format and labelled. The data repository includes the raw network traffic in PCAP format and the processed labelled data in CSV format. Our dataset was collected in a distributed manner, where network traffic was captured separately for each IoT device at the interface between the IoT gateway and the device. Our dataset was collected in a distributed manner, where network traffic was separately captured for each IoT device at the interface between the IoT gateway and the device. With its diverse traffic patterns and attack scenarios, this dataset provides a valuable resource for developing Intrusion Detection Systems and security mechanisms tailored to complex, large-scale IoT environments. The dataset is publicly available at Zenodo.

[AI-16] Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

链接: https://arxiv.org/abs/2502.03128
作者: Yuancheng Wang,Jiachen Zheng,Junan Zhang,Xueyao Zhang,Huan Liao,Zhizheng Wu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at this https URL.

[AI-17] Disentanglement in Difference: Directly Learning Semantically Disentangled Representations by Maximizing Inter-Factor Differences

链接: https://arxiv.org/abs/2502.03123
作者: Xingshen Zhang,Shuangrong Liu,Xintao Lu,Chaoran Pang,Lin Wang,Bo Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, Disentanglement in Difference(DiD) is proposed to address the inherent inconsistency between the statistical independence of latent variables and the goal of semantic disentanglement in disentanglement representation learning. Conventional disentanglement methods achieve disentanglement representation by improving statistical independence among latent variables. However, the statistical independence of latent variables does not necessarily imply that they are semantically unrelated, thus, improving statistical independence does not always enhance disentanglement performance. To address the above issue, DiD is proposed to directly learn semantic differences rather than the statistical independence of latent variables. In the DiD, a Difference Encoder is designed to measure the semantic differences; a contrastive loss function is established to facilitate inter-dimensional comparison. Both of them allow the model to directly differentiate and disentangle distinct semantic factors, thereby resolving the inconsistency between statistical independence and semantic disentanglement. Experimental results on the dSprites and 3DShapes datasets demonstrate that the proposed DiD outperforms existing mainstream methods across various disentanglement metrics.

[AI-18] At the Mahakumbh Faith Met Trag edy: Computational Analysis of Stampede Patterns Using Machine Learning and NLP

链接: https://arxiv.org/abs/2502.03120
作者: Abhinav Pratap
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: 6 pages, 4 figures, 3 tables

点击查看摘要

Abstract:This study employs machine learning, historical analysis, and natural language processing (NLP) to examine recurring lethal stampedes at Indias mass religious gatherings, focusing on the 2025 Mahakumbh tragedy in Prayagraj (48+ deaths) and its 1954 predecessor (700+ casualties). Through computational modeling of crowd dynamics and administrative records, it investigates how systemic vulnerabilities contribute to these disasters. Temporal trend analysis identifies persistent choke points, with narrow riverbank access routes linked to 92% of past stampede sites and lethal crowd densities (eight or more persons per square meter) recurring during spiritually significant moments like Mauni Amavasya. NLP analysis of seven decades of inquiry reports reveals cyclical administrative failures, where VIP route prioritization diverted safety resources in both 1954 and 2025, exacerbating fatalities. Statistical modeling demonstrates how ritual urgency overrides risk perception, leading to panic propagation patterns that mirror historical incidents. Findings support the Institutional Amnesia Theory, highlighting how disaster responses remain reactionary rather than preventive. By correlating archival patterns with computational crowd behavior analysis, this study frames stampedes as a collision of infrastructure limitations, socio spiritual urgency, and governance inertia, challenging disaster discourse to address how spiritual economies normalize preventable mortality.

[AI-19] Bellm an Error Centering

链接: https://arxiv.org/abs/2502.03104
作者: Xingguo Chen,Yu Gong,Shangdong Yang,Wenhao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper revisits the recently proposed reward centering algorithms including simple reward centering (SRC) and value-based reward centering (VRC), and points out that SRC is indeed the reward centering, while VRC is essentially Bellman error centering (BEC). Based on BEC, we provide the centered fixpoint for tabular value functions, as well as the centered TD fixpoint for linear value function approximation. We design the on-policy CTD algorithm and the off-policy CTDC algorithm, and prove the convergence of both algorithms. Finally, we experimentally validate the stability of our proposed algorithms. Bellman error centering facilitates the extension to various reinforcement learning algorithms.

[AI-20] E-3SFC: Communication-Efficient Federated Learning with Double-way Features Synthesizing

链接: https://arxiv.org/abs/2502.03092
作者: Yuhao Zhou,Yuxin Tian,Mingjia Shi,Yuanxi Li,Yanan Sun,Qing Ye,Jiancheng Lv
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by TNNLS. arXiv admin note: text overlap with arXiv:2302.13562

点击查看摘要

Abstract:The exponential growth in model sizes has significantly increased the communication burden in Federated Learning (FL). Existing methods to alleviate this burden by transmitting compressed gradients often face high compression errors, which slow down the model’s convergence. To simultaneously achieve high compression effectiveness and lower compression errors, we study the gradient compression problem from a novel perspective. Specifically, we propose a systematical algorithm termed Extended Single-Step Synthetic Features Compressing (E-3SFC), which consists of three sub-components, i.e., the Single-Step Synthetic Features Compressor (3SFC), a double-way compression algorithm, and a communication budget scheduler. First, we regard the process of gradient computation of a model as decompressing gradients from corresponding inputs, while the inverse process is considered as compressing the gradients. Based on this, we introduce a novel gradient compression method termed 3SFC, which utilizes the model itself as a decompressor, leveraging training priors such as model weights and objective functions. 3SFC compresses raw gradients into tiny synthetic features in a single-step simulation, incorporating error feedback to minimize overall compression errors. To further reduce communication overhead, 3SFC is extended to E-3SFC, allowing double-way compression and dynamic communication budget scheduling. Our theoretical analysis under both strongly convex and non-convex conditions demonstrates that 3SFC achieves linear and sub-linear convergence rates with aggregation noise. Extensive experiments across six datasets and six models reveal that 3SFC outperforms state-of-the-art methods by up to 13.4% while reducing communication costs by 111.6 times. These findings suggest that 3SFC can significantly enhance communication efficiency in FL without compromising model performance.

[AI-21] Implementing Large Quantum Boltzmann Machines as Generative AI Models for Dataset Balancing

链接: https://arxiv.org/abs/2502.03086
作者: Salvatore Sinno,Markus Bertl,Arati Sahoo,Bhavika Bhalgamiya,Thomas Groß,Nicholas Chancellor
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Quantum Physics (quant-ph)
*备注: accapted at IEEE International Conference on Next Generation Information System Engineering

点击查看摘要

Abstract:This study explores the implementation of large Quantum Restricted Boltzmann Machines (QRBMs), a key advancement in Quantum Machine Learning (QML), as generative models on D-Wave’s Pegasus quantum hardware to address dataset imbalance in Intrusion Detection Systems (IDS). By leveraging Pegasus’s enhanced connectivity and computational capabilities, a QRBM with 120 visible and 120 hidden units was successfully embedded, surpassing the limitations of default embedding tools. The QRBM synthesized over 1.6 million attack samples, achieving a balanced dataset of over 4.2 million records. Comparative evaluations with traditional balancing methods, such as SMOTE and RandomOversampler, revealed that QRBMs produced higher-quality synthetic samples, significantly improving detection rates, precision, recall, and F1 score across diverse classifiers. The study underscores the scalability and efficiency of QRBMs, completing balancing tasks in milliseconds. These findings highlight the transformative potential of QML and QRBMs as next-generation tools in data preprocessing, offering robust solutions for complex computational challenges in modern information systems.

[AI-22] Kozax: Flexible and Scalable Genetic Programming in JAX

链接: https://arxiv.org/abs/2502.03047
作者: Sigur de Vries,Sander W. Keemink,Marcel A. J. van Gerven
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 5 figures, 3 tables, 1 algorithm, 10 pages

点击查看摘要

Abstract:Genetic programming is an optimization algorithm inspired by natural selection which automatically evolves the structure of computer programs. The resulting computer programs are interpretable and efficient compared to black-box models with fixed structure. The fitness evaluation in genetic programming suffers from high computational requirements, limiting the performance on difficult problems. To reduce the runtime, many implementations of genetic programming require a specific data format, making the applicability limited to specific problem classes. Consequently, there is no efficient genetic programming framework that is usable for a wide range of tasks. To this end, we developed Kozax, a genetic programming framework that evolves symbolic expressions for arbitrary problems. We implemented Kozax using JAX, a framework for high-performance and scalable machine learning, which allows the fitness evaluation to scale efficiently to large populations or datasets on GPU. Furthermore, Kozax offers constant optimization, custom operator definition and simultaneous evolution of multiple trees. We demonstrate successful applications of Kozax to discover equations of natural laws, recover equations of hidden dynamic variables and evolve a control policy. Overall, Kozax provides a general, fast, and scalable library to optimize white-box solutions in the realm of scientific computing.

[AI-23] he Cake that is Intelligence and Who Gets to Bake it: An AI Analogy and its Implications for Participation

链接: https://arxiv.org/abs/2502.03038
作者: Martin Mundt,Anaelia Ovalle,Felix Friedrich,Pranav Agrawal,Subarnaduti Paul,Manuel Brack,Kristian Kersting,William Agnew
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In a widely popular analogy by Turing Award Laureate Yann LeCun, machine intelligence has been compared to cake - where unsupervised learning forms the base, supervised learning adds the icing, and reinforcement learning is the cherry on top. We expand this ‘cake that is intelligence’ analogy from a simple structural metaphor to the full life-cycle of AI systems, extending it to sourcing of ingredients (data), conception of recipes (instructions), the baking process (training), and the tasting and selling of the cake (evaluation and distribution). Leveraging our re-conceptualization, we describe each step’s entailed social ramifications and how they are bounded by statistical assumptions within machine learning. Whereas these technical foundations and social impacts are deeply intertwined, they are often studied in isolation, creating barriers that restrict meaningful participation. Our re-conceptualization paves the way to bridge this gap by mapping where technical foundations interact with social outcomes, highlighting opportunities for cross-disciplinary dialogue. Finally, we conclude with actionable recommendations at each stage of the metaphorical AI cake’s life-cycle, empowering prospective AI practitioners, users, and researchers, with increased awareness and ability to engage in broader AI discourse.

[AI-24] xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods

链接: https://arxiv.org/abs/2502.03014
作者: Pratinav Seth,Yashwardhan Rathore,Neeraj Kumar Singh,Chintan Chitroda,Vinay Kumar Sankarapu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The growing complexity of machine learning and deep learning models has led to an increased reliance on opaque “black box” systems, making it difficult to understand the rationale behind predictions. This lack of transparency is particularly challenging in high-stakes applications where interpretability is as important as accuracy. Post-hoc explanation methods are commonly used to interpret these models, but they are seldom rigorously evaluated, raising concerns about their reliability. The Python package xai_evals addresses this by providing a comprehensive framework for generating, benchmarking, and evaluating explanation methods across both tabular and image data modalities. It integrates popular techniques like SHAP, LIME, Grad-CAM, Integrated Gradients (IG), and Backtrace, while supporting evaluation metrics such as faithfulness, sensitivity, and robustness. xai_evals enhances the interpretability of machine learning models, fostering transparency and trust in AI systems. The library is open-sourced at this https URL .

[AI-25] FedMobileAgent : Training Mobile Agents Using Decentralized Self-Sourced Data from Diverse Users

链接: https://arxiv.org/abs/2502.02982
作者: Wenhao Wang,Zijie Yu,William Liu,Rui Ye,Tian Jin,Siheng Chen,Yanfeng Wang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of mobile agents has opened new opportunities for automating tasks on mobile devices. Training these agents requires large-scale high-quality data, which is costly using human labor. Given the vast number of mobile phone users worldwide, if automated data collection from them is feasible, the resulting data volume and the subsequently trained mobile agents could reach unprecedented levels. Nevertheless, two major challenges arise: (1) extracting high-level and low-level user instructions without involving human and (2) utilizing distributed data from diverse users while preserving privacy. To tackle these challenges, we propose FedMobileAgent, a collaborative framework that trains mobile agents using self-sourced data from diverse users. Specifically, it includes two techniques. First, we propose Auto-Annotation, which enables the automatic collection of high-quality datasets during users’ routine phone usage with minimal cost. Second, we introduce adapted aggregation to improve federated training of mobile agents on non-IID user data, by incorporating both episode- and step-level distributions. In distributed settings, FedMobileAgent achieves performance comparable to centralized human-annotated models at less than 0.02% of the cost, highlighting its potential for real-world applications. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2502.02982 [cs.AI] (or arXiv:2502.02982v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.02982 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] GB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential Dynamics ICLR2025

链接: https://arxiv.org/abs/2502.02975
作者: Lu Yi,Jie Peng,Yanping Zheng,Fengran Mo,Zhewei Wei,Yuhang Ye,Yue Zixuan,Zengfeng Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: published at ICLR 2025

点击查看摘要

Abstract:Future link prediction is a fundamental challenge in various real-world dynamic systems. To address this, numerous temporal graph neural networks (temporal GNNs) and benchmark datasets have been developed. However, these datasets often feature excessive repeated edges and lack complex sequential dynamics, a key characteristic inherent in many real-world applications such as recommender systems and Who-To-Follow'' on social networks. This oversight has led existing methods to inadvertently downplay the importance of learning sequential dynamics, focusing primarily on predicting repeated edges. In this study, we demonstrate that existing methods, such as GraphMixer and DyGFormer, are inherently incapable of learning simple sequential dynamics, such as a user who has followed OpenAI and Anthropic is more likely to follow AI at Meta next.‘’ Motivated by this issue, we introduce the Temporal Graph Benchmark with Sequential Dynamics (TGB-Seq), a new benchmark carefully curated to minimize repeated edges, challenging models to learn sequential dynamics and generalize to unseen edges. TGB-Seq comprises large real-world datasets spanning diverse domains, including e-commerce interactions, movie ratings, business reviews, social networks, citation networks and web link networks. Benchmarking experiments reveal that current methods usually suffer significant performance degradation and incur substantial training costs on TGB-Seq, posing new challenges and opportunities for future research. TGB-Seq datasets, leaderboards, and example codes are available at this https URL. Comments: published at ICLR 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.02975 [cs.LG] (or arXiv:2502.02975v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.02975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering for Enabling Fair LLM -Based Recommender Systems

链接: https://arxiv.org/abs/2502.02966
作者: Arya Fayyazi,Mehdi Kamal,Massoud Pedram
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose FACTER, a fairness-aware framework for LLM-based recommendation systems that integrates conformal prediction with dynamic prompt engineering. By introducing an adaptive semantic variance threshold and a violation-triggered mechanism, FACTER automatically tightens fairness constraints whenever biased patterns emerge. We further develop an adversarial prompt generator that leverages historical violations to reduce repeated demographic biases without retraining the LLM. Empirical results on MovieLens and Amazon show that FACTER substantially reduces fairness violations (up to 95.5%) while maintaining strong recommendation accuracy, revealing semantic variance as a potent proxy of bias.

[AI-28] (Neural-Symbolic) Machine Learning for Inconsistency Measurement

链接: https://arxiv.org/abs/2502.02963
作者: Sven Weinzierl,Carl Cora
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present machine-learning-based approaches for determining the \emphdegree of inconsistency – which is a numerical value – for propositional logic knowledge bases. Specifically, we present regression- and neural-based models that learn to predict the values that the inconsistency measures \incmi and \incat would assign to propositional logic knowledge bases. Our main motivation is that computing these values conventionally can be hard complexity-wise. As an important addition, we use specific postulates, that is, properties, of the underlying inconsistency measures to infer symbolic rules, which we combine with the learning-based models in the form of constraints. We perform various experiments and show that a) predicting the degree values is feasible in many situations, and b) including the symbolic constraints deduced from the rationality postulates increases the prediction quality.

[AI-29] Large Language Model Guided Self-Debugging Code Generation

链接: https://arxiv.org/abs/2502.02928
作者: Muntasir Adnan,Zhiwei Xu,Carlos C. N. Kuhn
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automated code generation is gaining significant importance in intelligent computer programming and system deployment. However, current approaches often face challenges in computational efficiency and lack robust mechanisms for code parsing and error correction. In this work, we propose a novel framework, PyCapsule, with a simple yet effective two-agent pipeline and efficient self-debugging modules for Python code generation. PyCapsule features sophisticated prompt inference, iterative error handling, and case testing, ensuring high generation stability, safety, and correctness. Empirically, PyCapsule achieves up to 5.7% improvement of success rate on HumanEval, 10.3% on HumanEval-ET, and 24.4% on BigCodeBench compared to the state-of-art methods. We also observe a decrease in normalized success rate given more self-debugging attempts, potentially affected by limited and noisy error feedback in retention. PyCapsule demonstrates broader impacts on advancing lightweight and efficient code generation for artificial intelligence systems.

[AI-30] opoCL: Topological Contrastive Learning for Time Series

链接: https://arxiv.org/abs/2502.02924
作者: Namwoo Kim,Hyungryul Baik,Yoonjin Yoon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to TNNLS (under review)

点击查看摘要

Abstract:Universal time series representation learning is challenging but valuable in real-world applications such as classification, anomaly detection, and forecasting. Recently, contrastive learning (CL) has been actively explored to tackle time series representation. However, a key challenge is that the data augmentation process in CL can distort seasonal patterns or temporal dependencies, inevitably leading to a loss of semantic information. To address this challenge, we propose Topological Contrastive Learning for time series (TopoCL). TopoCL mitigates such information loss by incorporating persistent homology, which captures the topological characteristics of data that remain invariant under transformations. In this paper, we treat the temporal and topological properties of time series data as distinct modalities. Specifically, we compute persistent homology to construct topological features of time series data, representing them in persistence diagrams. We then design a neural network to encode these persistent diagrams. Our approach jointly optimizes CL within the time modality and time-topology correspondence, promoting a comprehensive understanding of both temporal semantics and topological properties of time series. We conduct extensive experiments on four downstream tasks-classification, anomaly detection, forecasting, and transfer learning. The results demonstrate that TopoCL achieves state-of-the-art performance.

[AI-31] Adaptive Budget Optimization for Multichannel Advertising Using Combinatorial Bandits

链接: https://arxiv.org/abs/2502.02920
作者: Briti Gangopadhyay,Zhao Wang,Alberto Silvio Chiappa,Shingo Takamatsu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective budget allocation is crucial for optimizing the performance of digital advertising campaigns. However, the development of practical budget allocation algorithms remain limited, primarily due to the lack of public datasets and comprehensive simulation environments capable of verifying the intricacies of real-world advertising. While multi-armed bandit (MAB) algorithms have been extensively studied, their efficacy diminishes in non-stationary environments where quick adaptation to changing market dynamics is essential. In this paper, we advance the field of budget allocation in digital advertising by introducing three key contributions. First, we develop a simulation environment designed to mimic multichannel advertising campaigns over extended time horizons, incorporating logged real-world data. Second, we propose an enhanced combinatorial bandit budget allocation strategy that leverages a saturating mean function and a targeted exploration mechanism with change-point detection. This approach dynamically adapts to changing market conditions, improving allocation efficiency by filtering target regions based on domain knowledge. Finally, we present both theoretical analysis and empirical results, demonstrating that our method consistently outperforms baseline strategies, achieving higher rewards and lower regret across multiple real-world campaigns.

[AI-32] Interactive Symbolic Regression through Offline Reinforcement Learning: A Co-Design Framework

链接: https://arxiv.org/abs/2502.02917
作者: Yuan Tian,Wenqi Zhou,Michele Viscione,Hao Dong,David Kammer,Olga Fink
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: arXiv admin note: text overlap with arXiv:2402.05306

点击查看摘要

Abstract:Symbolic Regression (SR) holds great potential for uncovering underlying mathematical and physical relationships from observed data. However, the vast combinatorial space of possible expressions poses significant challenges for both online search methods and pre-trained transformer models. Additionally, current state-of-the-art approaches typically do not consider the integration of domain experts’ prior knowledge and do not support iterative interactions with the model during the equation discovery process. To address these challenges, we propose the Symbolic Q-network (Sym-Q), an advanced interactive framework for large-scale symbolic regression. Unlike previous large-scale transformer-based SR approaches, Sym-Q leverages reinforcement learning without relying on a transformer-based decoder. This formulation allows the agent to learn through offline reinforcement learning using any type of tree encoder, enabling more efficient training and inference. Furthermore, we propose a co-design mechanism, where the reinforcement learning-based Sym-Q facilitates effective interaction with domain experts at any stage of the equation discovery process. Users can dynamically modify generated nodes of the expression, collaborating with the agent to tailor the mathematical expression to best fit the problem and align with the assumed physical laws, particularly when there is prior partial knowledge of the expected behavior. Our experiments demonstrate that the pre-trained Sym-Q surpasses existing SR algorithms on the challenging SSDNC benchmark. Moreover, we experimentally show on real-world cases that its performance can be further enhanced by the interactive co-design mechanism, with Sym-Q achieving greater performance gains than other state-of-the-art models. Our reproducible code is available at this https URL.

[AI-33] MobiCLR: Mobility Time Series Contrastive Learning for Urban Region Representations

链接: https://arxiv.org/abs/2502.02912
作者: Namwoo Kim,Takahiro Yabe,Chanyoung Park,Yoonjin Yoon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to Information Sciences (under review)

点击查看摘要

Abstract:Recently, learning effective representations of urban regions has gained significant attention as a key approach to understanding urban dynamics and advancing smarter cities. Existing approaches have demonstrated the potential of leveraging mobility data to generate latent representations, providing valuable insights into the intrinsic characteristics of urban areas. However, incorporating the temporal dynamics and detailed semantics inherent in human mobility patterns remains underexplored. To address this gap, we propose a novel urban region representation learning model, Mobility Time Series Contrastive Learning for Urban Region Representations (MobiCLR), designed to capture semantically meaningful embeddings from inflow and outflow mobility patterns. MobiCLR uses contrastive learning to enhance the discriminative power of its representations, applying an instance-wise contrastive loss to capture distinct flow-specific characteristics. Additionally, we develop a regularizer to align output features with these flow-specific representations, enabling a more comprehensive understanding of mobility dynamics. To validate our model, we conduct extensive experiments in Chicago, New York, and Washington, D.C. to predict income, educational attainment, and social vulnerability. The results demonstrate that our model outperforms state-of-the-art models.

[AI-34] Policy Abstraction and Nash Refinement in Tree-Exploiting PSRO

链接: https://arxiv.org/abs/2502.02901
作者: Christine Konicki,Mithun Chakraborty,Michael P. Wellman
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Policy Space Response Oracles (PSRO) interleaves empirical game-theoretic analysis with deep reinforcement learning (DRL) to solve games too complex for traditional analytic methods. Tree-exploiting PSRO (TE-PSRO) is a variant of this approach that iteratively builds a coarsened empirical game model in extensive form using data obtained from querying a simulator that represents a detailed description of the game. We make two main methodological advances to TE-PSRO that enhance its applicability to complex games of imperfect information. First, we introduce a scalable representation for the empirical game tree where edges correspond to implicit policies learned through DRL. These policies cover conditions in the underlying game abstracted in the game model, supporting sustainable growth of the tree over epochs. Second, we leverage extensive form in the empirical model by employing refined Nash equilibria to direct strategy exploration. To enable this, we give a modular and scalable algorithm based on generalized backward induction for computing a subgame perfect equilibrium (SPE) in an imperfect-information game. We experimentally evaluate our approach on a suite of games including an alternating-offer bargaining game with outside offers; our results demonstrate that TE-PSRO converges toward equilibrium faster when new strategies are generated based on SPE rather than Nash equilibrium, and with reasonable time/memory requirements for the growing empirical model.

[AI-35] SensorChat: Answering Qualitative and Quantitative Questions during Long-Term Multimodal Sensor Interactions

链接: https://arxiv.org/abs/2502.02883
作者: Xiaofan Yu,Lanxiang Hu,Benjamin Reichman,Dylan Chu,Rushil Chandrupatla,Xiyuan Zhang,Larry Heck,Tajana Rosing
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Under review

点击查看摘要

Abstract:Natural language interaction with sensing systems is crucial for enabling all users to comprehend sensor data and its impact on their everyday lives. However, existing systems, which typically operate in a Question Answering (QA) manner, are significantly limited in terms of the duration and complexity of sensor data they can handle. In this work, we introduce SensorChat, the first end-to-end QA system designed for long-term sensor monitoring with multimodal and high-dimensional data including time series. SensorChat effectively answers both qualitative (requiring high-level reasoning) and quantitative (requiring accurate responses derived from sensor data) questions in real-world scenarios. To achieve this, SensorChat uses an innovative three-stage pipeline that includes question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) for intuitive human interactions and to guide the sensor data query process. Unlike existing multimodal LLMs, SensorChat incorporates an explicit query stage to precisely extract factual information from long-duration sensor data. We implement SensorChat and demonstrate its capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves up to 26% higher answer accuracy than state-of-the-art systems on quantitative questions. Additionally, a user study with eight volunteers highlights SensorChat’s effectiveness in handling qualitative and open-ended questions.

[AI-36] Vertical Federated Learning for Failure-Cause Identification in Disaggregated Microwave Networks

链接: https://arxiv.org/abs/2502.02874
作者: Fatih Temiz,Memedhe Ibrahimi,Francesco Musumeci,Claudio Passera,Massimo Tornatore
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 6 pages, 7 figure, IEEE ICC 2025

点击查看摘要

Abstract:Machine Learning (ML) has proven to be a promising solution to provide novel scalable and efficient fault management solutions in modern 5G-and-beyond communication networks. In the context of microwave networks, ML-based solutions have received significant attention. However, current solutions can only be applied to monolithic scenarios in which a single entity (e.g., an operator) manages the entire network. As current network architectures move towards disaggregated communication platforms in which multiple operators and vendors collaborate to achieve cost-efficient and reliable network management, new ML-based approaches for fault management must tackle the challenges of sharing business-critical information due to potential conflicts of interest. In this study, we explore the application of Federated Learning in disaggregated microwave networks for failure-cause identification using a real microwave hardware failure dataset. In particular, we investigate the application of two Vertical Federated Learning (VFL), namely using Split Neural Networks (SplitNNs) and Federated Learning based on Gradient Boosting Decision Trees (FedTree), on different multi-vendor deployment scenarios, and we compare them to a centralized scenario where data is managed by a single entity. Our experimental results show that VFL-based scenarios can achieve F1-Scores consistently within at most a 1% gap with respect to a centralized scenario, regardless of the deployment strategies or model types, while also ensuring minimal leakage of sensitive-data.

[AI-37] OmniRL: In-Context Reinforcement Learning by Large-Scale Meta-Training in Randomized Worlds

链接: https://arxiv.org/abs/2502.02869
作者: Fan Wang,Pengtao Shao,Yiming Zhang,Bo Yu,Shaoshan Liu,Ning Ding,Yang Cao,Yu Kang,Haifeng Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:We introduce OmniRL, a highly generalizable in-context reinforcement learning (ICRL) model that is meta-trained on hundreds of thousands of diverse tasks. These tasks are procedurally generated by randomizing state transitions and rewards within Markov Decision Processes. To facilitate this extensive meta-training, we propose two key innovations: 1. An efficient data synthesis pipeline for ICRL, which leverages the interaction histories of diverse behavior policies; and 2. A novel modeling framework that integrates both imitation learning and reinforcement learning (RL) within the context, by incorporating prior knowledge. For the first time, we demonstrate that in-context learning (ICL) alone, without any gradient-based fine-tuning, can successfully tackle unseen Gymnasium tasks through imitation learning, online RL, or offline RL. Additionally, we show that achieving generalized ICRL capabilities-unlike task identification-oriented few-shot learning-critically depends on long trajectories generated by variant tasks and diverse behavior policies. By emphasizing the potential of ICL and departing from pre-training focused on acquiring specific skills, we further underscore the significance of meta-training aimed at cultivating the ability of ICL itself.

[AI-38] A Systematic Approach for Assessing Large Language Models Test Case Generation Capability

链接: https://arxiv.org/abs/2502.02866
作者: Hung-Fu Chang,Mohammad Shokrolah Shirazi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Software testing ensures the quality and reliability of software products, but manual test case creation is labor-intensive. With the rise of large language models (LLMs), there is growing interest in unit test creation with LLMs. However, effective assessment of LLM-generated test cases is limited by the lack of standardized benchmarks that comprehensively cover diverse programming scenarios. To address the assessment of LLM’s test case generation ability and lacking dataset for evaluation, we propose the Generated Benchmark from Control-Flow Structure and Variable Usage Composition (GBCV) approach, which systematically generates programs used for evaluating LLMs’ test generation capabilities. By leveraging basic control-flow structures and variable usage, GBCV provides a flexible framework to create a spectrum of programs ranging from simple to complex. Because GPT-4o and GPT-3-Turbo are publicly accessible models, to present real-world regular user’s use case, we use GBCV to assess LLM performance on them. Our findings indicate that GPT-4o performs better on complex program structures, while all models effectively detect boundary values in simple conditions but face challenges with arithmetic computations. This study highlights the strengths and limitations of LLMs in test generation, provides a benchmark framework, and suggests directions for future improvement.

[AI-39] OceanChat: The Effect of Virtual Conversational AI Agents on Sustainable Attitude and Behavior Change

链接: https://arxiv.org/abs/2502.02863
作者: Pat Pataranutaporn,Alexander Doudkin,Pattie Maes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 21 pages, 18 figures, 2 tables

点击查看摘要

Abstract:Marine ecosystems face unprecedented threats from climate change and plastic pollution, yet traditional environmental education often struggles to translate awareness into sustained behavioral change. This paper presents OceanChat, an interactive system leveraging large language models to create conversational AI agents represented as animated marine creatures – specifically a beluga whale, a jellyfish, and a seahorse – designed to promote environmental behavior (PEB) and foster awareness through personalized dialogue. Through a between-subjects experiment (N=900), we compared three conditions: (1) Static Scientific Information, providing conventional environmental education through text and images; (2) Static Character Narrative, featuring first-person storytelling from 3D-rendered marine creatures; and (3) Conversational Character Narrative, enabling real-time dialogue with AI-powered marine characters. Our analysis revealed that the Conversational Character Narrative condition significantly increased behavioral intentions and sustainable choice preferences compared to static approaches. The beluga whale character demonstrated consistently stronger emotional engagement across multiple measures, including perceived anthropomorphism and empathy. However, impacts on deeper measures like climate policy support and psychological distance were limited, highlighting the complexity of shifting entrenched beliefs. Our work extends research on sustainability interfaces facilitating PEB and offers design principles for creating emotionally resonant, context-aware AI characters. By balancing anthropomorphism with species authenticity, OceanChat demonstrates how interactive narratives can bridge the gap between environmental knowledge and real-world behavior change.

[AI-40] Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning ICML2025

链接: https://arxiv.org/abs/2502.02844
作者: Sunwoo Lee,Jaebak Hwang,Yonghyeon Jo,Seungyul Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
*备注: 8 pages main, 21 pages appendix with reference. Submitted to ICML 2025

点击查看摘要

Abstract:Traditional robust methods in multi-agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering system-wide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL.

[AI-41] ask-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks ICML2025

链接: https://arxiv.org/abs/2502.02834
作者: Jeongmo Kim,Yisak Park,Minung Kim,Seungyul Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages main paper, 19 pages appendices with reference, Submitted to ICML 2025

点击查看摘要

Abstract:Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments.

[AI-42] Mol-LLM : Generalist Molecular LLM with Improved Graph Utilization

链接: https://arxiv.org/abs/2502.02810
作者: Chanhui Lee,Yuheon Song,YongJun Jeong,Hanbum Ko,Rodrigo Hormazabal,Sehui Han,Kyunghoon Bae,Sungbin Lim,Sungwoong Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have motivated the development of general LLMs for molecular tasks. While several studies have demonstrated that fine-tuned LLMs can achieve impressive benchmark performances, they are far from genuine generalist molecular LLMs due to a lack of fundamental understanding of molecular structure. Specifically, when given molecular task instructions, LLMs trained with naive next-token prediction training assign similar likelihood scores to both original and negatively corrupted molecules, revealing their lack of molecular structure understanding that is crucial for reliable and general molecular LLMs. To overcome this limitation and obtain a true generalist molecular LLM, we introduce a novel multi-modal training method based on a thorough multi-modal instruction tuning as well as a molecular structure preference optimization between chosen and rejected graphs. On various molecular benchmarks, the proposed generalist molecular LLM, called Mol-LLM, achieves state-of-the-art performances among generalist LLMs on most tasks, at the same time, surpassing or comparable to state-of-the-art specialist LLMs. Moreover, Mol-LLM also shows superior generalization performances in reaction prediction tasks, demonstrating the effect of the molecular structure understanding for generalization perspective.

[AI-43] Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

链接: https://arxiv.org/abs/2502.02797
作者: Sunny Sanyal,Hayden Prairie,Rudrajit Das,Ali Kavis,Sujay Sanghavi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 49 pages, 4 figures, 12 tables. Code available at this https URL

点击查看摘要

Abstract:Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as “catastrophic forgetting”. This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a sample weighting scheme for the fine-tuning data solely based on the pre-trained model’s losses. Specifically, we upweight the easy samples on which the pre-trained model’s loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a 0.8% drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving 5.4% more accuracy on the pre-training datasets. Our code is publicly available at this https URL .

[AI-44] Inducing Diversity in Differentiable Search Indexing

链接: https://arxiv.org/abs/2502.02788
作者: Abhijeet Phatak,Jayant Sachdev,Sean D Rosario,Swati Kirti,Chittaranjan Tripathy
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differentiable Search Indexing (DSI) is a recent paradigm for information retrieval which uses a transformer-based neural network architecture as the document index to simplify the retrieval process. A differentiable index has many advantages enabling modifications, updates or extensions to the index. In this work, we explore balancing relevance and novel information content (diversity) for training DSI systems inspired by Maximal Marginal Relevance (MMR), and show the benefits of our approach over the naive DSI training. We present quantitative and qualitative evaluations of relevance and diversity measures obtained using our method on NQ320K and MSMARCO datasets in comparison to naive DSI. With our approach, it is possible to achieve diversity without any significant impact to relevance. Since we induce diversity while training DSI, the trained model has learned to diversify while being relevant. This obviates the need for a post-processing step to induce diversity in the recall set as typically performed using MMR. Our approach will be useful for Information Retrieval problems where both relevance and diversity are important such as in sub-topic retrieval. Our work can also be easily be extended to the incremental DSI settings which would enable fast updates to the index while retrieving a diverse recall set.

[AI-45] Classroom Simulacra: Building Contextual Student Generative Agents in Online Education for Learning Behavioral Simulation

链接: https://arxiv.org/abs/2502.02780
作者: Songlin Xu,Hao-Ning Wen,Hongyi Pan,Dallas Dominguez,Dongyin Hu,Xinyu Zhang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:Student simulation supports educators to improve teaching by interacting with virtual students. However, most existing approaches ignore the modulation effects of course materials because of two challenges: the lack of datasets with granularly annotated course materials, and the limitation of existing simulation models in processing extremely long textual data. To solve the challenges, we first run a 6-week education workshop from N = 60 students to collect fine-grained data using a custom built online education system, which logs students’ learning behaviors as they interact with lecture materials over time. Second, we propose a transferable iterative reflection (TIR) module that augments both prompting-based and finetuning-based large language models (LLMs) for simulating learning behaviors. Our comprehensive experiments show that TIR enables the LLMs to perform more accurate student simulation than classical deep learning models, even with limited demonstration data. Our TIR approach better captures the granular dynamism of learning performance and inter-student correlations in classrooms, paving the way towards a ‘‘digital twin’’ for online education.

[AI-46] Cross-Modality Embedding of Force and Language for Natural Human-Robot Communication

链接: https://arxiv.org/abs/2502.02772
作者: Ravi Tejwani,Karl Velazquez,John Payne,Paolo Bonato,Harry Asada
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Under review in RSS 2025

点击查看摘要

Abstract:A method for cross-modality embedding of force profile and words is presented for synergistic coordination of verbal and haptic communication. When two people carry a large, heavy object together, they coordinate through verbal communication about the intended movements and physical forces applied to the object. This natural integration of verbal and physical cues enables effective coordination. Similarly, human-robot interaction could achieve this level of coordination by integrating verbal and haptic communication modalities. This paper presents a framework for embedding words and force profiles in a unified manner, so that the two communication modalities can be integrated and coordinated in a way that is effective and synergistic. Here, it will be shown that, although language and physical force profiles are deemed completely different, the two can be embedded in a unified latent space and proximity between the two can be quantified. In this latent space, a force profile and words can a) supplement each other, b) integrate the individual effects, and c) substitute in an exchangeable manner. First, the need for cross-modality embedding is addressed, and the basic architecture and key building block technologies are presented. Methods for data collection and implementation challenges will be addressed, followed by experimental results and discussions.

[AI-47] Planning with affordances: Integrating learned affordance models and symbolic planning

链接: https://arxiv.org/abs/2502.02768
作者: Rajesh Mangannavar
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Intelligent agents working in real-world environments must be able to learn about the environment and its capabilities which enable them to take actions to change to the state of the world to complete a complex multi-step task in a photorealistic environment. Learning about the environment is especially important to perform various multiple-step tasks without having to redefine an agent’s action set for different tasks or environment settings. In our work, we augment an existing task and motion planning framework with learned affordance models of objects in the world to enable planning and executing multi-step tasks using learned models. Each task can be seen as changing the current state of the world to a given goal state. The affordance models provide us with what actions are possible and how to perform those actions in any given state. A symbolic planning algorithm uses this information and the starting and goal state to create a feasible plan to reach the desired goal state to complete a given task. We demonstrate our approach in a virtual 3D photorealistic environment, AI2-Thor, and evaluate it on real-world tasks. Our results show that our agent quickly learns how to interact with the environment and is well prepared to perform tasks such as “Moving an object out of the way to reach the desired location.”

[AI-48] PatchPilot: A Stable and Cost-Efficient Agent ic Patching Framework

链接: https://arxiv.org/abs/2502.02747
作者: Hongwei Li,Yuheng Tang,Shiqi Wang,Wenbo Guo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent research builds various patching agents that combine large language models (LLMs) with non-ML tools and achieve promising results on the state-of-the-art (SOTA) software patching benchmark, SWE-Bench. Based on how to determine the patching workflows, existing patching agents can be categorized as agent-based planning methods, which rely on LLMs for planning, and human-based planning methods, which follow a pre-defined workflow. At a high level, agent-based planning methods achieve high patching performance but with a high cost and limited stability. Human-based planning methods, on the other hand, are more stable and efficient but have key workflow limitations that compromise their patching performance. In this paper, we propose PatchPilot, an agentic patcher that strikes a balance between patching efficacy, stability, and cost-efficiency. PatchPilot proposes a novel human-based planning workflow with five components: reproduction, localization, generation, validation, and refinement (where refinement is unique to PatchPilot). We introduce novel and customized designs to each component to optimize their effectiveness and efficiency. Through extensive experiments on the SWE-Bench benchmarks, PatchPilot shows a superior performance than existing open-source methods while maintaining low cost (less than 1 per instance) and ensuring higher stability. We also conduct a detailed ablation study to validate the key designs in each component.

[AI-49] Vision-Language Model Dialog Games for Self-Improvement

链接: https://arxiv.org/abs/2502.02740
作者: Ksenia Konyushkova,Christos Kaplanis,Serkan Cabi,Misha Denil
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.

[AI-50] Parameter Tracking in Federated Learning with Adaptive Optimization

链接: https://arxiv.org/abs/2502.02727
作者: Evan Chen. Jianing Zhang,Shiqiang Wang,Chaoyue Liu,Christopher Brinton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In Federated Learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Gradient Tracking (GT) has recently emerged as a solution which mitigates this issue by introducing correction terms to local model updates. To date, GT has only been considered under Stochastic Gradient Descent (SGD)-based model training, while modern FL frameworks increasingly employ adaptive optimizers for improved convergence. In this work, we generalize the GT framework to a more flexible Parameter Tracking (PT) paradigm and propose two novel adaptive optimization algorithms, \tt FAdamET and \tt FAdamGT, that integrate PT into Adam-based FL. We provide a rigorous convergence analysis of these algorithms under non-convex settings. Our experimental results demonstrate that both proposed algorithms consistently outperform existing methods when evaluating total communication cost and total computation cost across varying levels of data heterogeneity, showing the effectiveness of correcting first-order information in federated adaptive optimization.

[AI-51] An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification

链接: https://arxiv.org/abs/2502.02715
作者: Riddhi More,Jeremy S. Bradbury
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Flaky tests exhibit non-deterministic behavior during execution and they may pass or fail without any changes to the program under test. Detecting and classifying these flaky tests is crucial for maintaining the robustness of automated test suites and ensuring the overall reliability and confidence in the testing. However, flaky test detection and classification is challenging due to the variability in test behavior, which can depend on environmental conditions and subtle code interactions. Large Language Models (LLMs) offer promising approaches to address this challenge, with fine-tuning and few-shot learning (FSL) emerging as viable techniques. With enough data fine-tuning a pre-trained LLM can achieve high accuracy, making it suitable for organizations with more resources. Alternatively, we introduce FlakyXbert, an FSL approach that employs a Siamese network architecture to train efficiently with limited data. To understand the performance and cost differences between these two methods, we compare fine-tuning on larger datasets with FSL in scenarios restricted by smaller datasets. Our evaluation involves two existing flaky test datasets, FlakyCat and IDoFT. Our results suggest that while fine-tuning can achieve high accuracy, FSL provides a cost-effective approach with competitive accuracy, which is especially beneficial for organizations or projects with limited historical data available for training. These findings underscore the viability of both fine-tuning and FSL in flaky test detection and classification with each suited to different organizational needs and resource availability.

[AI-52] Practically Effective Adjustment Variable Selection in Causal Inference

链接: https://arxiv.org/abs/2502.02701
作者: Atsushi Noda,Takashi Isozaki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:In the estimation of causal effects, one common method for removing the influence of confounders is to adjust the variables that satisfy the back-door criterion. However, it is not always possible to uniquely determine sets of such variables. Moreover, real-world data is almost always limited, which means it may be insufficient for statistical estimation. Therefore, we propose criteria for selecting variables from a list of candidate adjustment variables along with an algorithm to prevent accuracy degradation in causal effect estimation. We initially focus on directed acyclic graphs (DAGs) and then outlines specific steps for applying this method to completed partially directed acyclic graphs (CPDAGs). We also present and prove a theorem on causal effect computation possibility in CPDAGs. Finally, we demonstrate the practical utility of our method using both existing and artificial data.

[AI-53] Efficient Implementation of the Global Cardinality Constraint with Costs

链接: https://arxiv.org/abs/2502.02688
作者: Margaux Schmied,Jean-Charles Regin
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: Published at the 30th International Conference on Principles and Practice of Constraint Programming (CP 2024)

点击查看摘要

Abstract:The success of Constraint Programming relies partly on the global constraints and implementation of the associated filtering algorithms. Recently, new ideas emerged to improve these implementations in practice, especially regarding the all different constraint. In this paper, we consider the cardinality constraint with costs. The cardinality constraint is a generalization of the all different constraint that specifies the number of times each value must be taken by a given set of variables in a solution. The version with costs introduces an assignment cost and bounds the total sum of assignment costs. The arc consistency filtering algorithm of this constraint is difficult to use in practice, as it systematically searches for many shortest paths. We propose a new approach that works with upper bounds on shortest paths based on landmarks. This approach can be seen as a preprocessing. It is fast and avoids, in practice, a large number of explicit computations of shortest paths.

[AI-54] MedRAX: Medical Reasoning Agent for Chest X-ray

链接: https://arxiv.org/abs/2502.02673
作者: Adibvafa Fallahpour,Jun Ma,Alif Munim,Hongwei Lyu,Bo Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 11 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available at this https URL

[AI-55] Fully Autonomous AI Agents Should Not be Developed

链接: https://arxiv.org/abs/2502.02649
作者: Margaret Mitchell,Avijit Ghosh,Alexandra Sasha Luccioni,Giada Pistilli
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper argues that fully autonomous AI agents should not be developed. In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels and detail the ethical values at play in each, documenting trade-offs in potential benefits and risks. Our analysis reveals that risks to people increase with the autonomy of a system: The more control a user cedes to an AI agent, the more risks to people arise. Particularly concerning are safety risks, which affect human life and impact further values.

[AI-56] -SimFT: Alignment of Generative Models with Simulation Feedback for Pareto-Front Design Exploration

链接: https://arxiv.org/abs/2502.02628
作者: Hyunmin Cheong,Mohammadmehdi Ataei,Amir Hosein Khasahmadi,Pradeep Kumar Jayaraman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep generative models have recently shown success in solving complex engineering design problems where models predict solutions that address the design requirements specified as input. However, there remains a challenge in aligning such models for effective design exploration. For many design problems, finding a solution that meets all the requirements is infeasible. In such a case, engineers prefer to obtain a set of Pareto optimal solutions with respect to those requirements, but uniform sampling of generative models may not yield a useful Pareto front. To address this gap, we introduce a new framework for Pareto-front design exploration with simulation fine-tuned generative models. First, the framework adopts preference alignment methods developed for Large Language Models (LLMs) and showcases the first application in fine-tuning a generative model for engineering design. The important distinction here is that we use a simulator instead of humans to provide accurate and scalable feedback. Next, we propose epsilon-sampling, inspired by the epsilon-constraint method used for Pareto-front generation with classical optimization algorithms, to construct a high-quality Pareto front with the fine-tuned models. Our framework, named e-SimFT, is shown to produce better-quality Pareto fronts than existing multi-objective alignment methods.

[AI-57] Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances

链接: https://arxiv.org/abs/2502.02623
作者: German Martinez Matilla,Jakub Marecek
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Sample complexity of bias estimation is a lower bound on the runtime of any bias detection method. Many regulatory frameworks require the bias to be tested for all subgroups, whose number grows exponentially with the number of protected attributes. Unless one wishes to run a bias detection with a doubly-exponential run-time, one should like to have polynomial complexity of bias detection for a single subgroup. At the same time, the reference data may be based on surveys, and thus come with non-trivial uncertainty. Here, we reformulate bias detection as a point-to-subspace problem on the space of measures and show that, for supremum norm, it can be subsampled efficiently. In particular, our probabilistically approximately correct (PAC) results are corroborated by tests on well-known instances. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST) Cite as: arXiv:2502.02623 [cs.LG] (or arXiv:2502.02623v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.02623 Focus to learn more arXiv-issued DOI via DataCite

[AI-58] PolarQuant: Quantizing KV Caches with Polar Transformation

链接: https://arxiv.org/abs/2502.02617
作者: Insu Han,Praneeth Kacham,Amin Karbasi,Vahab Mirrokni,Amir Zandieh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) require significant memory to store Key-Value (KV) embeddings in their KV cache, especially when handling long-range contexts. Quantization of these KV embeddings is a common technique to reduce memory consumption. This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation. Our method transforms the KV embeddings into polar coordinates using an efficient recursive algorithm and then quantizes resulting angles. Our key insight is that, after random preconditioning, the angles in the polar representation exhibit a tightly bounded and highly concentrated distribution with an analytically computable form. This nice distribution eliminates the need for explicit normalization, a step required by traditional quantization methods which introduces significant memory overhead because quantization parameters (e.g., zero point and scale) must be stored in full precision per each data block. PolarQuant bypasses this normalization step, enabling substantial memory savings. The long-context evaluation demonstrates that PolarQuant compresses the KV cache by over x4.2 while achieving the best quality scores compared to the state-of-the-art methods.

[AI-59] Reconstructing 3D Flow from 2D Data with Diffusion Transformer

链接: https://arxiv.org/abs/2502.02593
作者: Fan Lei
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Fluid flow is a widely applied physical problem, crucial in various fields. Due to the highly nonlinear and chaotic nature of fluids, analyzing fluid-related problems is exceptionally challenging. Computational fluid dynamics (CFD) is the best tool for this analysis but involves significant computational resources, especially for 3D simulations, which are slow and resource-intensive. In experimental fluid dynamics, PIV cost increases with dimensionality. Reconstructing 3D flow fields from 2D PIV data could reduce costs and expand application scenarios. Here, We propose a Diffusion Transformer-based method for reconstructing 3D flow fields from 2D flow data. By embedding the positional information of 2D planes into the model, we enable the reconstruction of 3D flow fields from any combination of 2D slices, enhancing flexibility. We replace global attention with window and plane attention to reduce computational costs associated with higher dimensions without compromising performance. Our experiments demonstrate that our model can efficiently and accurately reconstruct 3D flow fields from 2D data, producing realistic results.

[AI-60] HadamRNN: Binary and Sparse Ternary Orthogonal RNNs

链接: https://arxiv.org/abs/2502.00047
作者: Armand Foucault(IMT, ANITI),Franck Mamalet(ANITI),François Malgouyres(IMT)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Binary and sparse ternary weights in neural networks enable faster computations and lighter representations, facilitating their use on edge devices with limited computational power. Meanwhile, vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. To date, no method has successfully achieved binarization or ternarization of vanilla RNN weights. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary orthogonal matrices. This method enables the training of orthogonal RNNs (ORNNs) with binary and sparse ternary recurrent weights, effectively creating a specific class of binary and sparse ternary vanilla RNNs. The resulting ORNNs, called HadamRNN and lock-HadamRNN, are evaluated on benchmarks such as the copy task, permuted and sequential MNIST tasks, and IMDB dataset. Despite binarization or sparse ternarization, these RNNs maintain performance levels comparable to state-of-the-art full-precision models, highlighting the effectiveness of our approach. Notably, our approach is the first solution with binary recurrent weights capable of tackling the copy task over 1000 timesteps.

[AI-61] AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

链接: https://arxiv.org/abs/2501.15417
作者: Junan Zhang,Jing Yang,Zihao Fang,Yuancheng Wang,Zehua Zhang,Zhuo Wang,Fan Fan,Zhizheng Wu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker’s timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at this https URL.

[AI-62] A Beams Eye View to Fluence Maps 3D Network for Ultra Fast VMAT Radiotherapy Planning

链接: https://arxiv.org/abs/2502.03360
作者: Simon Arberet,Florin C. Ghesu,Riqiang Gao,Martin Kraus,Jonathan Sackett,Esa Kuusela,Ali Kamen
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Volumetric Modulated Arc Therapy (VMAT) revolutionizes cancer treatment by precisely delivering radiation while sparing healthy tissues. Fluence maps generation, crucial in VMAT planning, traditionally involves complex and iterative, and thus time consuming processes. These fluence maps are subsequently leveraged for leaf-sequence. The deep-learning approach presented in this article aims to expedite this by directly predicting fluence maps from patient data. We developed a 3D network which we trained in a supervised way using a combination of L1 and L2 losses, and RT plans generated by Eclipse and from the REQUITE dataset, taking the RT dose map as input and the fluence maps computed from the corresponding RT plans as target. Our network predicts jointly the 180 fluence maps corresponding to the 180 control points (CP) of single arc VMAT plans. In order to help the network, we pre-process the input dose by computing the projections of the 3D dose map to the beam’s eye view (BEV) of the 180 CPs, in the same coordinate system as the fluence maps. We generated over 2000 VMAT plans using Eclipse to scale up the dataset size. Additionally, we evaluated various network architectures and analyzed the impact of increasing the dataset size. We are measuring the performance in the 2D fluence maps domain using image metrics (PSNR, SSIM), as well as in the 3D dose domain using the dose-volume histogram (DVH) on a validation dataset. The network inference, which does not include the data loading and processing, is less than 20ms. Using our proposed 3D network architecture as well as increasing the dataset size using Eclipse improved the fluence map reconstruction performance by approximately 8 dB in PSNR compared to a U-Net architecture trained on the original REQUITE dataset. The resulting DVHs are very close to the one of the input target dose.

[AI-63] Adaptive Variational Inference in Probabilistic Graphical Models: Beyond Bethe Tree-Reweighted and Convex Free Energies UAI

链接: https://arxiv.org/abs/2502.03341
作者: Harald Leisenberger,Franz Pernkopf
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work has been submitted to the Conference on Uncertainty in Artificial Intelligence (UAI) 2025 for possible publication

点击查看摘要

Abstract:Variational inference in probabilistic graphical models aims to approximate fundamental quantities such as marginal distributions and the partition function. Popular approaches are the Bethe approximation, tree-reweighted, and other types of convex free energies. These approximations are efficient but can fail if the model is complex and highly interactive. In this work, we analyze two classes of approximations that include the above methods as special cases: first, if the model parameters are changed; and second, if the entropy approximation is changed. We discuss benefits and drawbacks of either approach, and deduce from this analysis how a free energy approximation should ideally be constructed. Based on our observations, we propose approximations that automatically adapt to a given model and demonstrate their effectiveness for a range of difficult problems.

[AI-64] Astromer 2

链接: https://arxiv.org/abs/2502.02717
作者: Cristobal Donoso-Oliva,Ignacio Becker,Pavlos Protopapas,Guillermo Cabrera-Vives,Martina Cádiz-Leyton,Daniel Moreno-Cartagena
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 17 figures

点击查看摘要

Abstract:Foundational models have emerged as a powerful paradigm in deep learning field, leveraging their capacity to learn robust representations from large-scale datasets and effectively to diverse downstream applications such as classification. In this paper, we present Astromer 2 a foundational model specifically designed for extracting light curve embeddings. We introduce Astromer 2 as an enhanced iteration of our self-supervised model for light curve analysis. This paper highlights the advantages of its pre-trained embeddings, compares its performance with that of its predecessor, Astromer 1, and provides a detailed empirical analysis of its capabilities, offering deeper insights into the model’s representations. Astromer 2 is pretrained on 1.5 million single-band light curves from the MACHO survey using a self-supervised learning task that predicts randomly masked observations within sequences. Fine-tuning on a smaller labeled dataset allows us to assess its performance in classification tasks. The quality of the embeddings is measured by the F1 score of an MLP classifier trained on Astromer-generated embeddings. Our results demonstrate that Astromer 2 significantly outperforms Astromer 1 across all evaluated scenarios, including limited datasets of 20, 100, and 500 samples per class. The use of weighted per-sample embeddings, which integrate intermediate representations from Astromer’s attention blocks, is particularly impactful. Notably, Astromer 2 achieves a 15% improvement in F1 score on the ATLAS dataset compared to prior models, showcasing robust generalization to new datasets. This enhanced performance, especially with minimal labeled data, underscores the potential of Astromer 2 for more efficient and scalable light curve analysis.

[AI-65] scBIT: Integrating Single-cell Transcriptomic Data into fMRI-based Prediction for Alzheimers Disease Diagnosis

链接: https://arxiv.org/abs/2502.02630
作者: Yu-An Huang,Yao Hu,Yue-Chao Li,Xiyue Cao,Xinyuan Li,Kay Chen Tan,Zhu-Hong You,Zhi-An Huang
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 31 pages, 5 figures

点击查看摘要

Abstract:Functional MRI (fMRI) and single-cell transcriptomics are pivotal in Alzheimer’s disease (AD) research, each providing unique insights into neural function and molecular mechanisms. However, integrating these complementary modalities remains largely unexplored. Here, we introduce scBIT, a novel method for enhancing AD prediction by combining fMRI with single-nucleus RNA (snRNA). scBIT leverages snRNA as an auxiliary modality, significantly improving fMRI-based prediction models and providing comprehensive interpretability. It employs a sampling strategy to segment snRNA data into cell-type-specific gene networks and utilizes a self-explainable graph neural network to extract critical subgraphs. Additionally, we use demographic and genetic similarities to pair snRNA and fMRI data across individuals, enabling robust cross-modal learning. Extensive experiments validate scBIT’s effectiveness in revealing intricate brain region-gene associations and enhancing diagnostic prediction accuracy. By advancing brain imaging transcriptomics to the single-cell level, scBIT sheds new light on biomarker discovery in AD research. Experimental results show that incorporating snRNA data into the scBIT model significantly boosts accuracy, improving binary classification by 3.39% and five-class classification by 26.59%. The codes were implemented in Python and have been released on GitHub (this https URL) and Zenodo (this https URL) with detailed instructions.

[AI-66] Graph Structure Learning for Tumor Microenvironment with Cell Type Annotation from non-spatial scRNA-seq data

链接: https://arxiv.org/abs/2502.02629
作者: Yu-An Huang,Yue-Chao Li,Hai-Ru You,Jie Pan,Xiyue Cao,Xinyuan Li,Zhi-An Huang,Zhu-Hong You
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 6 figures

点击查看摘要

Abstract:The exploration of cellular heterogeneity within the tumor microenvironment (TME) via single-cell RNA sequencing (scRNA-seq) is essential for understanding cancer progression and response to therapy. Current scRNA-seq approaches, however, lack spatial context and rely on incomplete datasets of ligand-receptor interactions (LRIs), limiting accurate cell type annotation and cell-cell communication (CCC) inference. This study addresses these challenges using a novel graph neural network (GNN) model that enhances cell type prediction and cell interaction analysis. Our study utilized a dataset consisting of 49,020 cells from 19 patients across three cancer types: Leukemia, Breast Invasive Carcinoma, and Colorectal Cancer. The proposed scGSL model demonstrated robust performance, achieving an average accuracy of 84.83%, precision of 86.23%, recall of 81.51%, and an F1 score of 80.92% across all datasets. These metrics represent a significant enhancement over existing methods, which typically exhibit lower performance metrics. Additionally, by reviewing existing literature on gene interactions within the TME, the scGSL model proves to robustly identify biologically meaningful gene interactions in an unsupervised manner, validated by significant expression differences in key gene pairs across various cancers. The source code and data used in this paper can be found in this https URL.

机器学习

[LG-0] Prediction of the Most Fire-Sensitive Point in Building Structures with Differentiable Agents for Thermal Simulators

链接: https://arxiv.org/abs/2502.03424
作者: Yuan Xinjie,Khalid M. Mosalam
类目: Machine Learning (cs.LG)
*备注: This paper is currently under review at Computer-Aided Civil and Infrastructure Engineering

点击查看摘要

Abstract:Fire safety is a critical area of research in civil and mechanical engineering, particularly in ensuring the structural stability of buildings during fire events. The Most Fire-Sensitive Point (MFSP) in a structure is the location where a fire would cause the greatest impact on structural stability. Accurate prediction of the MFSP is vital for streamlining structural assessments and optimizing the design process. This paper presents a novel framework for MFSP prediction using a neural network-based approach that integrates fire dynamics and finite element analysis through a differentiable agent model. The framework focuses on predicting the Maximum Interstory Drift Ratio (MIDR), a key indicator of structural performance under fire conditions. By leveraging the differentiable agent model, we efficiently generate labeled data for MFSP and directly train a predictor for this critical metric. To achieve this, we generated extensive simulation data encompassing structural and fire scenarios and employed graph neural networks to represent the building structures. Transfer learning was applied to optimize the training process, and an edge update mechanism was introduced to dynamically adjust edge attributes, reflecting property changes under fire conditions. The proposed model was rigorously evaluated on simulation data, demonstrating strong performance in accurately predicting both MIDR and MFSP, thus advancing fire safety analysis for building structures.

[LG-1] From Features to Transformers: Redefining Ranking for Scalable Impact

链接: https://arxiv.org/abs/2502.03417
作者: Fedor Borisyuk,Lars Hertel,Ganesh Parameswaran,Gaurav Srivastava,Sudarshan Srinivasa Ramanujam,Borja Ocejo,Peng Du,Andrei Akterskii,Neil Daftary,Shao Tang,Daqi Sun,Qiang Charles Xiao,Deepesh Nathani,Mohit Kothari,Yun Dai,Aman Gupta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present LiGR, a large-scale ranking framework developed at LinkedIn that brings state-of-the-art transformer-based modeling architectures into production. We introduce a modified transformer architecture that incorporates learned normalization and simultaneous set-wise attention to user history and ranked items. This architecture enables several breakthrough achievements, including: (1) the deprecation of most manually designed feature engineering, outperforming the prior state-of-the-art system using only few features (compared to hundreds in the baseline), (2) validation of the scaling law for ranking systems, showing improved performance with larger models, more training data, and longer context sequences, and (3) simultaneous joint scoring of items in a set-wise manner, leading to automated improvements in diversity. To enable efficient serving of large ranking models, we describe techniques to scale inference effectively using single-pass processing of user history and set-wise attention. We also summarize key insights from various ablation studies and A/B tests, highlighting the most impactful technical approaches.

[LG-2] Deep Reinforcement Learning-Based Optimization of Second-Life Battery Utilization in Electric Vehicles Charging Stations

链接: https://arxiv.org/abs/2502.03412
作者: Rouzbeh Haghighi,Ali Hassan,Van-Hai Bui,Akhtar Hussain,Wencong Su
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 5 pages, 6 figures, Accepted, 2025 IEEE Power and Energy Society General Meeting (PESGM 2025), Austin, TX, USA

点击查看摘要

Abstract:The rapid rise in electric vehicle (EV) adoption presents significant challenges in managing the vast number of retired EV batteries. Research indicates that second-life batteries (SLBs) from EVs typically retain considerable residual capacity, offering extended utility. These batteries can be effectively repurposed for use in EV charging stations (EVCS), providing a cost-effective alternative to new batteries and reducing overall planning costs. Integrating battery energy storage systems (BESS) with SLBs into EVCS is a promising strategy to alleviate system overload. However, efficient operation of EVCS with integrated BESS is hindered by uncertainties such as fluctuating EV arrival and departure times and variable power prices from the grid. This paper presents a deep reinforcement learning-based (DRL) planning framework for EV charging stations with BESS, leveraging SLBs. We employ the advanced soft actor-critic (SAC) approach, training the model on a year’s worth of data to account for seasonal variations, including weekdays and holidays. A tailored reward function enables effective offline training, allowing real-time optimization of EVCS operations under uncertainty.

[LG-3] Detecting Strategic Deception Using Linear Probes WWW

链接: https://arxiv.org/abs/2502.03407
作者: Nicholas Goldowsky-Dill,Bilal Chughtai,Stefan Heimersheim,Marius Hobbhahn
类目: Machine Learning (cs.LG)
*备注: Website: this http URL Code: this http URL

点击查看摘要

Abstract:AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while their internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al., 2023) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading (Scheurer et al., 2023) and purposely underperforming on safety evaluations (Benton et al., 2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes’ outputs can be viewed at this http URL and our code at this http URL.

[LG-4] CAPE: Covariate-Adjusted Pre-Training for Epidemic Time Series Forecasting

链接: https://arxiv.org/abs/2502.03393
作者: Zewen Liu,Juntong Ni,Max S. Y. Lau,Wei Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate forecasting of epidemic infection trajectories is crucial for safeguarding public health. However, limited data availability during emerging outbreaks and the complex interaction between environmental factors and disease dynamics present significant challenges for effective forecasting. In response, we introduce CAPE, a novel epidemic pre-training framework designed to harness extensive disease datasets from diverse regions and integrate environmental factors directly into the modeling process for more informed decision-making on downstream diseases. Based on a covariate adjustment framework, CAPE utilizes pre-training combined with hierarchical environment contrasting to identify universal patterns across diseases while estimating latent environmental influences. We have compiled a diverse collection of epidemic time series datasets and validated the effectiveness of CAPE under various evaluation scenarios, including full-shot, few-shot, zero-shot, cross-location, and cross-disease settings, where it outperforms the leading baseline by an average of 9.9% in full-shot and 14.3% in zero-shot settings. The code will be released upon acceptance.

[LG-5] Explain Yourself Briefly! Self-Explaining Neural Networks with Concise Sufficient Reason s ICLR2025

链接: https://arxiv.org/abs/2502.03391
作者: Shahaf Bassan,Shlomit Gur,Ron Eliav
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: To appear in ICLR 2025

点击查看摘要

Abstract:Minimal sufficient reasons represent a prevalent form of explanation - the smallest subset of input features which, when held constant at their corresponding values, ensure that the prediction remains unchanged. Previous post-hoc methods attempt to obtain such explanations but face two main limitations: (1) Obtaining these subsets poses a computational challenge, leading most scalable methods to converge towards suboptimal, less meaningful subsets; (2) These methods heavily rely on sampling out-of-distribution input assignments, potentially resulting in counterintuitive behaviors. To tackle these limitations, we propose in this work a self-supervised training approach, which we term sufficient subset training (SST). Using SST, we train models to generate concise sufficient reasons for their predictions as an integral part of their output. Our results indicate that our framework produces succinct and faithful subsets substantially more efficiently than competing post-hoc methods, while maintaining comparable predictive performance.

[LG-6] A Structured Reasoning Framework for Unbalanced Data Classification Using Probabilistic Models

链接: https://arxiv.org/abs/2502.03386
作者: Junliang Du,Shiyu Dou,Bohuan Yang,Jiacheng Hu,Tai An
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies a Markov network model for unbalanced data, aiming to solve the problems of classification bias and insufficient minority class recognition ability of traditional machine learning models in environments with uneven class distribution. By constructing joint probability distribution and conditional dependency, the model can achieve global modeling and reasoning optimization of sample categories. The study introduced marginal probability estimation and weighted loss optimization strategies, combined with regularization constraints and structured reasoning methods, effectively improving the generalization ability and robustness of the model. In the experimental stage, a real credit card fraud detection dataset was selected and compared with models such as logistic regression, support vector machine, random forest and XGBoost. The experimental results show that the Markov network performs well in indicators such as weighted accuracy, F1 score, and AUC-ROC, significantly outperforming traditional classification models, demonstrating its strong decision-making ability and applicability in unbalanced data scenarios. Future research can focus on efficient model training, structural optimization, and deep learning integration in large-scale unbalanced data environments and promote its wide application in practical applications such as financial risk control, medical diagnosis, and intelligent monitoring.

[LG-7] Energy-Efficient Flying LoRa Gateways: A Multi-Agent Reinforcement Learning Approach

链接: https://arxiv.org/abs/2502.03377
作者: Abdullahi Isa Ahmed,El Mehdi Amhoud
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:With the rapid development of next-generation Internet of Things (NG-IoT) networks, the increasing number of connected devices has led to a surge in power consumption. This rise in energy demand poses significant challenges to resource availability and raises sustainability concerns for large-scale IoT deployments. Efficient energy utilization in communication networks, particularly for power-constrained IoT devices, has thus become a critical area of research. In this paper, we deployed flying LoRa gateways (GWs) mounted on unmanned aerial vehicles (UAVs) to collect data from LoRa end devices (EDs) and transmit it to a central server. Our primary objective is to maximize the global system energy efficiency (EE) of wireless LoRa networks by joint optimization of transmission power (TP), spreading factor (SF), bandwidth (W), and ED association. To solve this challenging problem, we model the problem as a partially observable Markov decision process (POMDP), where each flying LoRa GW acts as a learning agent using a cooperative Multi-Agent Reinforcement Learning (MARL) approach under centralized training and decentralized execution (CTDE). Simulation results demonstrate that our proposed method, based on the multi-agent proximal policy optimization (MAPPO) algorithm, significantly improves the global system EE and surpasses the conventional MARL schemes.

[LG-8] SyMANTIC: An Efficient Symbolic Regression Method for Interpretable and Parsimonious Model Discovery in Science and Beyond

链接: https://arxiv.org/abs/2502.03367
作者: Madhav R. Muthyala,Farshud Sorourifar,You Peng,Joel A. Paulson
类目: Machine Learning (cs.LG)
*备注: Main and SI compiled into the pdf Main:48 pages, 7 figures SI: 29 pages, 2 figures

点击查看摘要

Abstract:Symbolic regression (SR) is an emerging branch of machine learning focused on discovering simple and interpretable mathematical expressions from data. Although a wide-variety of SR methods have been developed, they often face challenges such as high computational cost, poor scalability with respect to the number of input dimensions, fragility to noise, and an inability to balance accuracy and complexity. This work introduces SyMANTIC, a novel SR algorithm that addresses these challenges. SyMANTIC efficiently identifies (potentially several) low-dimensional descriptors from a large set of candidates (from \sim 10^5 to \sim 10^10 or more) through a unique combination of mutual information-based feature selection, adaptive feature expansion, and recursively applied \ell_0 -based sparse regression. In addition, it employs an information-theoretic measure to produce an approximate set of Pareto-optimal equations, each offering the best-found accuracy for a given complexity. Furthermore, our open-source implementation of SyMANTIC, built on the PyTorch ecosystem, facilitates easy installation and GPU acceleration. We demonstrate the effectiveness of SyMANTIC across a range of problems, including synthetic examples, scientific benchmarks, real-world material property predictions, and chaotic dynamical system identification from small datasets. Extensive comparisons show that SyMANTIC uncovers similar or more accurate models at a fraction of the cost of existing SR methods.

[LG-9] Rethinking Approximate Gaussian Inference in Classification

链接: https://arxiv.org/abs/2502.03366
作者: Bálint Mucsányi,Nathaël Da Costa,Philipp Hennig
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 15 figures

点击查看摘要

Abstract:In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed, which output Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax. However, such softmax Gaussian integrals cannot be solved analytically, and Monte Carlo (MC) approximations can be costly and noisy. We propose a simple change in the learning objective which allows the exact computation of predictives and enjoys improved training dynamics, with no runtime or memory overhead. This framework is compatible with a family of output activation functions that includes the softmax, as well as element-wise normCDF and sigmoid. Moreover, it allows for approximating the Gaussian pushforwards with Dirichlet distributions by analytic moment matching. We evaluate our approach combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (ImageNet, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Code is available at this https URL.

[LG-10] A Match Made in Heaven? Matching Test Cases and Vulnerabilities With the VUTECO Approach

链接: https://arxiv.org/abs/2502.03365
作者: Emanuele Iannone,Quang-Cuong Bui,Riccardo Scandariato
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work was partially supported by EU-funded project Sec4AI4Sec (grant no. 101120393)

点击查看摘要

Abstract:Software vulnerabilities are commonly detected via static analysis, penetration testing, and fuzzing. They can also be found by running unit tests - so-called vulnerability-witnessing tests - that stimulate the security-sensitive behavior with crafted inputs. Developing such tests is difficult and time-consuming; thus, automated data-driven approaches could help developers intercept vulnerabilities earlier. However, training and validating such approaches require a lot of data, which is currently scarce. This paper introduces VUTECO, a deep learning-based approach for collecting instances of vulnerability-witnessing tests from Java repositories. VUTECO carries out two tasks: (1) the “Finding” task to determine whether a test case is security-related, and (2) the “Matching” task to relate a test case to the exact vulnerability it is witnessing. VUTECO successfully addresses the Finding task, achieving perfect precision and 0.83 F0.5 score on validated test cases in VUL4J and returning 102 out of 145 (70%) correct security-related test cases from 244 open-source Java projects. Despite showing sufficiently good performance for the Matching task - i.e., 0.86 precision and 0.68 F0.5 score - VUTECO failed to retrieve any valid match in the wild. Nevertheless, we observed that in almost all of the matches, the test case was still security-related despite being matched to the wrong vulnerability. In the end, VUTECO can help find vulnerability-witnessing tests, though the matching with the right vulnerability is yet to be solved; the findings obtained lay the stepping stone for future research on the matter.

[LG-11] Scaling laws in wearable human activity recognition

链接: https://arxiv.org/abs/2502.03364
作者: Tom Hoddes,Alex Bijamov,Saket Joshi,Daniel Roggen,Ali Etemad,Robert Harle,David Racz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many deep architectures and self-supervised pre-training techniques have been proposed for human activity recognition (HAR) from wearable multimodal sensors. Scaling laws have the potential to help move towards more principled design by linking model capacity with pre-training data volume. Yet, scaling laws have not been established for HAR to the same extent as in language and vision. By conducting an exhaustive grid search on both amount of pre-training data and Transformer architectures, we establish the first known scaling laws for HAR. We show that pre-training loss scales with a power law relationship to amount of data and parameter count and that increasing the number of users in a dataset results in a steeper improvement in performance than increasing data per user, indicating that diversity of pre-training data is important, which contrasts to some previously reported findings in self-supervised HAR. We show that these scaling laws translate to downstream performance improvements on three HAR benchmark datasets of postures, modes of locomotion and activities of daily living: UCI HAR and WISDM Phone and WISDM Watch. Finally, we suggest some previously published works should be revisited in light of these scaling laws with more adequate model capacities.

[LG-12] Interaction-Aware Gaussian Weighting for Clustered Federated Learning

链接: https://arxiv.org/abs/2502.03340
作者: Alessandro Licciardi,Davide Leo,Eros Faní,Barbara Caputo,Marco Ciccone
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy. However, conventional FL struggles with data heterogeneity and class imbalance, which degrade model performance. Clustered FL balances personalization and decentralized training by grouping clients with analogous data distributions, enabling improved accuracy while adhering to privacy constraints. This approach effectively mitigates the adverse impact of heterogeneity in FL. In this work, we propose a novel clustered FL method, FedGWC (Federated Gaussian Weighting Clustering), which groups clients based on their data distribution, allowing training of a more robust and personalized model on the identified clusters. FedGWC identifies homogeneous clusters by transforming individual empirical losses to model client interactions with a Gaussian reward mechanism. Additionally, we introduce the Wasserstein Adjusted Score, a new clustering metric for FL to evaluate cluster cohesion with respect to the individual class distribution. Our experiments on benchmark datasets show that FedGWC outperforms existing FL algorithms in cluster quality and classification accuracy, validating the efficacy of our approach.

[LG-13] IRIS: An Immersive Robot Interaction System

链接: https://arxiv.org/abs/2502.03297
作者: Xinkai Jiang,Qihao Yuan,Enes Ulas Dincer,Hongyi Zhou,Ge Li,Xueyin Li,Julius Haag,Nicolas Schreiber,Kailai Li,Gerhard Neumann,Rudolf Lioutikov
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces IRIS, an immersive Robot Interaction System leveraging Extended Reality (XR), designed for robot data collection and interaction across multiple simulators, benchmarks, and real-world scenarios. While existing XR-based data collection systems provide efficient and intuitive solutions for large-scale data collection, they are often challenging to reproduce and reuse. This limitation arises because current systems are highly tailored to simulator-specific use cases and environments. IRIS is a novel, easily extendable framework that already supports multiple simulators, benchmarks, and even headsets. Furthermore, IRIS is able to include additional information from real-world sensors, such as point clouds captured through depth cameras. A unified scene specification is generated directly from simulators or real-world sensors and transmitted to XR headsets, creating identical scenes in XR. This specification allows IRIS to support any of the objects, assets, and robots provided by the simulators. In addition, IRIS introduces shared spatial anchors and a robust communication protocol that links simulations between multiple XR headsets. This feature enables multiple XR headsets to share a synchronized scene, facilitating collaborative and multi-user data collection. IRIS can be deployed on any device that supports the Unity Framework, encompassing the vast majority of commercially available headsets. In this work, IRIS was deployed and tested on the Meta Quest 3 and the HoloLens 2. IRIS showcased its versatility across a wide range of real-world and simulated scenarios, using current popular robot simulators such as MuJoCo, IsaacSim, CoppeliaSim, and Genesis. In addition, a user study evaluates IRIS on a data collection task for the LIBERO benchmark. The study shows that IRIS significantly outperforms the baseline in both objective and subjective metrics.

[LG-14] General Time-series Model for Universal Knowledge Representation of Multivariate Time-Series data

链接: https://arxiv.org/abs/2502.03264
作者: Cheng He,Xu Huang,Gangwei Jiang,Zhaoyi Li,Defu Lian,Hong Xie,Enhong Chen,Xijie Liang,Zengrong Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Universal knowledge representation is a central problem for multivariate time series(MTS) foundation models and yet remains open. This paper investigates this problem from the first principle and it makes four folds of contributions. First, a new empirical finding is revealed: time series with different time granularities (or corresponding frequency resolutions) exhibit distinct joint distributions in the frequency domain. This implies a crucial aspect of learning universal knowledge, one that has been overlooked by previous studies. Second, a novel Fourier knowledge attention mechanism is proposed to enable learning time granularity-aware representations from both the temporal and frequency domains. Third, an autoregressive blank infilling pre-training framework is incorporated to time series analysis for the first time, leading to a generative tasks agnostic pre-training strategy. To this end, we develop the General Time-series Model (GTM), a unified MTS foundation model that addresses the limitation of contemporary time series models, which often require token, pre-training, or model-level customizations for downstream tasks adaption. Fourth, extensive experiments show that GTM outperforms state-of-the-art (SOTA) methods across all generative tasks, including long-term forecasting, anomaly detection, and imputation.

[LG-15] RiemannGFM: Learning a Graph Foundation Model from Riemannian Geometry WWW25

链接: https://arxiv.org/abs/2502.03251
作者: Li Sun,Zhenhao Huang,Suyang Zhou,Qiqi Wan,Hao Peng,Philip Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW25

点击查看摘要

Abstract:The foundation model has heralded a new era in artificial intelligence, pretraining a single model to offer cross-domain transferability on different datasets. Graph neural networks excel at learning graph data, the omnipresent non-Euclidean structure, but often lack the generalization capacity. Hence, graph foundation model is drawing increasing attention, and recent efforts have been made to leverage Large Language Models. On the one hand, existing studies primarily focus on text-attributed graphs, while a wider range of real graphs do not contain fruitful textual attributes. On the other hand, the sequential graph description tailored for the Large Language Model neglects the structural complexity, which is a predominant characteristic of the graph. Such limitations motivate an important question: Can we go beyond Large Language Models, and pretrain a universal model to learn the structural knowledge for any graph? The answer in the language or vision domain is a shared vocabulary. We observe the fact that there also exist shared substructures underlying graph domain, and thereby open a new opportunity of graph foundation model with structural vocabulary. The key innovation is the discovery of a simple yet effective structural vocabulary of trees and cycles, and we explore its inherent connection to Riemannian geometry. Herein, we present a universal pretraining model, RiemannGFM. Concretely, we first construct a novel product bundle to incorporate the diverse geometries of the vocabulary. Then, on this constructed space, we stack Riemannian layers where the structural vocabulary, regardless of specific graph, is learned in Riemannian manifold offering cross-domain transferability. Extensive experiments show the effectiveness of RiemannGFM on a diversity of real graphs.

[LG-16] Calibrated Unsupervised Anomaly Detection in Multivariate Time-series using Reinforcement Learning

链接: https://arxiv.org/abs/2502.03245
作者: Saba Sanami,Amir G. Aghdam
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: This paper has been accepted for publication and presentation at the 2025 IEEE International systems Conference (SysCon)

点击查看摘要

Abstract:This paper investigates unsupervised anomaly detection in multivariate time-series data using reinforcement learning (RL) in the latent space of an autoencoder. A significant challenge is the limited availability of anomalous data, often leading to misclassifying anomalies as normal events, thus raising false negatives. RL can help overcome this limitation by promoting exploration and balancing exploitation during training, effectively preventing overfitting. Wavelet analysis is also utilized to enhance anomaly detection, enabling time-series data decomposition into both time and frequency domains. This approach captures anomalies at multiple resolutions, with wavelet coefficients extracted to detect both sudden and subtle shifts in the data, thereby refining the anomaly detection process. We calibrate the decision boundary by generating synthetic anomalies and embedding a supervised framework within the model. This supervised element aids the unsupervised learning process by fine-tuning the decision boundary and increasing the model’s capacity to distinguish between normal and anomalous patterns effectively.

[LG-17] Analysis of Value Iteration Through Absolute Probability Sequences

链接: https://arxiv.org/abs/2502.03244
作者: Arsenii Mustafin,Sebastien Colla,Alex Olshevsky,Ioannis Ch. Paschalidis
类目: Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Value Iteration is a widely used algorithm for solving Markov Decision Processes (MDPs). While previous studies have extensively analyzed its convergence properties, they primarily focus on convergence with respect to the infinity norm. In this work, we use absolute probability sequences to develop a new line of analysis and examine the algorithm’s convergence in terms of the L^2 norm, offering a new perspective on its behavior and performance.

[LG-18] Pioneer: Physics-informed Riemannian Graph ODE for Entropy-increasing Dynamics AAAI25

链接: https://arxiv.org/abs/2502.03236
作者: Li Sun,Ziheng Zhang,Zixi Wang,Yujie Wang,Qiqi Wan,Hao Li,Hao Peng,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI25

点击查看摘要

Abstract:Dynamic interacting system modeling is important for understanding and simulating real world systems. The system is typically described as a graph, where multiple objects dynamically interact with each other and evolve over time. In recent years, graph Ordinary Differential Equations (ODE) receive increasing research attentions. While achieving encouraging results, existing solutions prioritize the traditional Euclidean space, and neglect the intrinsic geometry of the system and physics laws, e.g., the principle of entropy increasing. The limitations above motivate us to rethink the system dynamics from a fresh perspective of Riemannian geometry, and pose a more realistic problem of physics-informed dynamic system modeling, considering the underlying geometry and physics law for the first time. In this paper, we present a novel physics-informed Riemannian graph ODE for a wide range of entropy-increasing dynamic systems (termed as Pioneer). In particular, we formulate a differential system on the Riemannian manifold, where a manifold-valued graph ODE is governed by the proposed constrained Ricci flow, and a manifold preserving Gyro-transform aware of system geometry. Theoretically, we report the provable entropy non-decreasing of our formulation, obeying the physics laws. Empirical results show the superiority of Pioneer on real datasets.

[LG-19] Adversarial Dependence Minimization

链接: https://arxiv.org/abs/2502.03227
作者: Pierre-François De Plaen,Tinne Tuytelaars,Marc Proesmans,Luc Van Gool
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many machine learning techniques rely on minimizing the covariance between output feature dimensions to extract minimally redundant representations from data. However, these methods do not eliminate all dependencies/redundancies, as linearly uncorrelated variables can still exhibit nonlinear relationships. This work provides a differentiable and scalable algorithm for dependence minimization that goes beyond linear pairwise decorrelation. Our method employs an adversarial game where small networks identify dependencies among feature dimensions, while the encoder exploits this information to reduce dependencies. We provide empirical evidence of the algorithm’s convergence and demonstrate its utility in three applications: extending PCA to nonlinear decorrelation, improving the generalization of image classification methods, and preventing dimensional collapse in self-supervised representation learning.

[LG-20] SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels

链接: https://arxiv.org/abs/2502.03201
作者: Xiangyu Dong,Xingyi Zhang,Lei Chen,Mingxuan Yuan,Sibo Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Node Anomaly Detection (NAD) has gained significant attention in the deep learning community due to its diverse applications in real-world scenarios. Existing NAD methods primarily embed graphs within a single Euclidean space, while overlooking the potential of non-Euclidean spaces. Besides, to address the prevalent issue of limited supervision in real NAD tasks, previous methods tend to leverage synthetic data to collect auxiliary information, which is not an effective solution as shown in our experiments. To overcome these challenges, we introduce a novel SpaceGNN model designed for NAD tasks with extremely limited labels. Specifically, we provide deeper insights into a task-relevant framework by empirically analyzing the benefits of different spaces for node representations, based on which, we design a Learnable Space Projection function that effectively encodes nodes into suitable spaces. Besides, we introduce the concept of weighted homogeneity, which we empirically and theoretically validate as an effective coefficient during information propagation. This concept inspires the design of the Distance Aware Propagation module. Furthermore, we propose the Multiple Space Ensemble module, which extracts comprehensive information for NAD under conditions of extremely limited supervision. Our findings indicate that this module is more beneficial than data augmentation techniques for NAD. Extensive experiments conducted on 9 real datasets confirm the superiority of SpaceGNN, which outperforms the best rival by an average of 8.55% in AUC and 4.31% in F1 scores. Our code is available at this https URL.

[LG-21] PICBench: Benchmarking LLM s for Photonic Integrated Circuits Design

链接: https://arxiv.org/abs/2502.03159
作者: Yuchao Wu,Xiaofei Yu,Hao Chen,Yang Luo,Yeyu Tong,Yuzhe Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) have shown remarkable potential in automating various tasks in digital chip design, the field of Photonic Integrated Circuits (PICs)-a promising solution to advanced chip designs-remains relatively unexplored in this context. The design of PICs is time-consuming and prone to errors due to the extensive and repetitive nature of code involved in photonic chip design. In this paper, we introduce PICBench, the first benchmarking and evaluation framework specifically designed to automate PIC design generation using LLMs, where the generated output takes the form of a netlist. Our benchmark consists of dozens of meticulously crafted PIC design problems, spanning from fundamental device designs to more complex circuit-level designs. It automatically evaluates both the syntax and functionality of generated PIC designs by comparing simulation outputs with expert-written solutions, leveraging an open-source simulator. We evaluate a range of existing LLMs, while also conducting comparative tests on various prompt engineering techniques to enhance LLM performance in automated PIC design. The results reveal the challenges and potential of LLMs in the PIC design domain, offering insights into the key areas that require further research and development to optimize automation in this field. Our benchmark and evaluation code is available at this https URL.

[LG-22] Symmetry-Aware Bayesian Flow Networks for Crystal Generation

链接: https://arxiv.org/abs/2502.03146
作者: Laura Ruple,Luca Torresi,Henrik Schopmans,Pascal Friederich
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:The discovery of new crystalline materials is essential to scientific and technological progress. However, traditional trial-and-error approaches are inefficient due to the vast search space. Recent advancements in machine learning have enabled generative models to predict new stable materials by incorporating structural symmetries and to condition the generation on desired properties. In this work, we introduce SymmBFN, a novel symmetry-aware Bayesian Flow Network (BFN) for crystalline material generation that accurately reproduces the distribution of space groups found in experimentally observed crystals. SymmBFN substantially improves efficiency, generating stable structures at least 50 times faster than the next-best method. Furthermore, we demonstrate its capability for property-conditioned generation, enabling the design of materials with tailored properties. Our findings establish BFNs as an effective tool for accelerating the discovery of crystalline materials.

[LG-23] Machine Learning-Driven Student Performance Prediction for Enhancing Tiered Instruction

链接: https://arxiv.org/abs/2502.03143
作者: Yawen Chen,Jiande Sun,Jinhui Wang,Liang Zhao,Xinmin Song,Linbo Zhai
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Student performance prediction is one of the most important subjects in educational data mining. As a modern technology, machine learning offers powerful capabilities in feature extraction and data modeling, providing essential support for diverse application scenarios, as evidenced by recent studies confirming its effectiveness in educational data mining. However, despite extensive prediction experiments, machine learning methods have not been effectively integrated into practical teaching strategies, hindering their application in modern education. In addition, massive features as input variables for machine learning algorithms often leads to information redundancy, which can negatively impact prediction accuracy. Therefore, how to effectively use machine learning methods to predict student performance and integrate the prediction results with actual teaching scenarios is a worthy research subject. To this end, this study integrates the results of machine learning-based student performance prediction with tiered instruction, aiming to enhance student outcomes in target course, which is significant for the application of educational data mining in contemporary teaching scenarios. Specifically, we collect original educational data and perform feature selection to reduce information redundancy. Then, the performance of five representative machine learning methods is analyzed and discussed with Random Forest showing the best performance. Furthermore, based on the results of the classification of students, tiered instruction is applied accordingly, and different teaching objectives and contents are set for all levels of students. The comparison of teaching outcomes between the control and experimental classes, along with the analysis of questionnaire results, demonstrates the effectiveness of the proposed framework.

[LG-24] Underwater Soft Fin Flapping Motion with Deep Neural Network Based Surrogate Model

链接: https://arxiv.org/abs/2502.03135
作者: Yuya Hamamatsu,Pavlo Kupyn,Roza Gkliva,Asko Ristolainen,Maarja Kruusmaa
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted in IEEE International Conference on Soft Robotics 2025 (Robosoft)

点击查看摘要

Abstract:This study presents a novel framework for precise force control of fin-actuated underwater robots by integrating a deep neural network (DNN)-based surrogate model with reinforcement learning (RL). To address the complex interactions with the underwater environment and the high experimental costs, a DNN surrogate model acts as a simulator for enabling efficient training for the RL agent. Additionally, grid-switching control is applied to select optimized models for specific force reference ranges, improving control accuracy and stability. Experimental results show that the RL agent, trained in the surrogate simulation, generates complex thrust motions and achieves precise control of a real soft fin actuator. This approach provides an efficient control solution for fin-actuated robots in challenging underwater environments.

[LG-25] Double Distillation Network for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2502.03125
作者: Yang Zhou,Siying Wang,Wenyu Chen,Ruoning Zhang,Zhitong Zhao,Zixuan Zhang
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning typically employs a centralized training-decentralized execution (CTDE) framework to alleviate the non-stationarity in environment. However, the partial observability during execution may lead to cumulative gap errors gathered by agents, impairing the training of effective collaborative policies. To overcome this challenge, we introduce the Double Distillation Network (DDN), which incorporates two distillation modules aimed at enhancing robust coordination and facilitating the collaboration process under constrained information. The external distillation module uses a global guiding network and a local policy network, employing distillation to reconcile the gap between global training and local execution. In addition, the internal distillation module introduces intrinsic rewards, drawn from state information, to enhance the exploration capabilities of agents. Extensive experiments demonstrate that DDN significantly improves performance across multiple scenarios.

[LG-26] Multi-objective methods in Federated Learning: A survey and taxonomy

链接: https://arxiv.org/abs/2502.03108
作者: Maria Hartmann,Grégoire Danoy,Pascal Bouvry
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The Federated Learning paradigm facilitates effective distributed machine learning in settings where training data is decentralized across multiple clients. As the popularity of the strategy grows, increasingly complex real-world problems emerge, many of which require balancing conflicting demands such as fairness, utility, and resource consumption. Recent works have begun to recognise the use of a multi-objective perspective in answer to this challenge. However, this novel approach of combining federated methods with multi-objective optimisation has never been discussed in the broader context of both fields. In this work, we offer a first clear and systematic overview of the different ways the two fields can be integrated. We propose a first taxonomy on the use of multi-objective methods in connection with Federated Learning, providing a targeted survey of the state-of-the-art and proposing unambiguous labels to categorise contributions. Given the developing nature of this field, our taxonomy is designed to provide a solid basis for further research, capturing existing works while anticipating future additions. Finally, we outline open challenges and possible directions for further research.

[LG-27] Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms

链接: https://arxiv.org/abs/2502.03095
作者: Xuerui Su,Yue Wang,Jinhua Zhu,Mingyang Yi,Feng Xu,Zhiming Ma,Yuting Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of Large Language Models (LLMs), numerous Reinforcement Learning from Human Feedback (RLHF) algorithms have been introduced to improve model safety and alignment with human preferences. These algorithms can be divided into two main frameworks based on whether they require an explicit reward (or value) function for training: actor-critic-based Proximal Policy Optimization (PPO) and alignment-based Direct Preference Optimization (DPO). The mismatch between DPO and PPO, such as DPO’s use of a classification loss driven by human-preferred data, has raised confusion about whether DPO should be classified as a Reinforcement Learning (RL) algorithm. To address these ambiguities, we focus on three key aspects related to DPO, RL, and other RLHF algorithms: (1) the construction of the loss function; (2) the target distribution at which the algorithm converges; (3) the impact of key components within the loss function. Specifically, we first establish a unified framework named UDRRA connecting these algorithms based on the construction of their loss functions. Next, we uncover their target policy distributions within this framework. Finally, we investigate the critical components of DPO to understand their impact on the convergence rate. Our work provides a deeper understanding of the relationship between DPO, RL, and other RLHF algorithms, offering new insights for improving existing algorithms.

[LG-28] Automatic Prompt Optimization Techniques: Exploring the Potential for Synthetic Data Generation

链接: https://arxiv.org/abs/2502.03078
作者: Nina Freise,Marius Heitlinger,Ruben Nuredini,Gerrit Meixner
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted for publication in the Proceedings of the 2025 HCI International Conference

点击查看摘要

Abstract:Artificial Intelligence (AI) advancement is heavily dependent on access to large-scale, high-quality training data. However, in specialized domains such as healthcare, data acquisition faces significant constraints due to privacy regulations, ethical considerations, and limited availability. While synthetic data generation offers a promising solution, conventional approaches typically require substantial real data for training generative models. The emergence of large-scale prompt-based models presents new opportunities for synthetic data generation without direct access to protected data. However, crafting effective prompts for domain-specific data generation remains challenging, and manual prompt engineering proves insufficient for achieving output with sufficient precision and authenticity. We review recent developments in automatic prompt optimization, following PRISMA guidelines. We analyze six peer-reviewed studies published between 2020 and 2024 that focus on automatic data-free prompt optimization methods. Our analysis reveals three approaches: feedback-driven, error-based, and control-theoretic. Although all approaches demonstrate promising capabilities in prompt refinement and adaptation, our findings suggest the need for an integrated framework that combines complementary optimization techniques to enhance synthetic data generation while minimizing manual intervention. We propose future research directions toward developing robust, iterative prompt optimization frameworks capable of improving the quality of synthetic data. This advancement can be particularly crucial for sensitive fields and in specialized domains where data access is restricted, potentially transforming how we approach synthetic data generation for AI development.

[LG-29] Optimizing Electric Vehicles Charging using Large Language Models and Graph Neural Networks

链接: https://arxiv.org/abs/2502.03067
作者: Stavros Orfanoudakis,Peter Palensky,Pedro P. Vergara
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Maintaining grid stability amid widespread electric vehicle (EV) adoption is vital for sustainable transportation. Traditional optimization methods and Reinforcement Learning (RL) approaches often struggle with the high dimensionality and dynamic nature of real-time EV charging, leading to sub-optimal solutions. To address these challenges, this study demonstrates that combining Large Language Models (LLMs), for sequence modeling, with Graph Neural Networks (GNNs), for relational information extraction, not only outperforms conventional EV smart charging methods, but also paves the way for entirely new research directions and innovative solutions.

[LG-30] Optimal Best Arm Identification with Post-Action Context

链接: https://arxiv.org/abs/2502.03061
作者: Mohammad Shahverdikondori,Amir Mohammad Abouei,Alireza Rezaeimoghadam,Negar Kiyavash
类目: Machine Learning (cs.LG)
*备注: 37 pages, 7 figures

点击查看摘要

Abstract:We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a \textitpost-action context in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) \textitnon-separator , where the reward depends on both the action and the context, and (ii) \textitseparator , where the reward depends solely on the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. For the separator setting, we propose a novel sampling rule called \textitG-tracking , which uses the geometry of the context space to directly track the contexts rather than the actions. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.

[LG-31] Understanding and Enhancing the Transferability of Jailbreaking Attacks ICLR2025

链接: https://arxiv.org/abs/2502.03052
作者: Runqi Lin,Bo Han,Fengwang Li,Tongling Liu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model’s intent perception. By incorporating adversarial sequences, these attacks can redirect the source LLM’s focus away from malicious-intent tokens in the original input, thereby obstructing the model’s intent recognition and eliciting harmful responses. Nevertheless, these adversarial sequences fail to mislead the target LLM’s intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM’s parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model’s focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs.

[LG-32] he Ensemble Kalman Update is an Empirical Matheron Update

链接: https://arxiv.org/abs/2502.03048
作者: Dan MacKinlay
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Ensemble Kalman Filter (EnKF) is a widely used method for data assimilation in high-dimensional systems. In this paper, we show that the ensemble update step of the EnKF is equivalent to an empirical version of the Matheron update popular in the study of Gaussian process regression. While this connection is simple, it seems not to be widely known, the literature about each technique seems distinct, and connections between the methods are not exploited. This paper exists to provide an informal introduction to the connection, with the necessary definitions so that it is intelligible to as broad an audience as possible. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.03048 [cs.LG] (or arXiv:2502.03048v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.03048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] RepLoRA: Reparameterizing Low-Rank Adaptation via the Perspective of Mixture of Experts

链接: https://arxiv.org/abs/2502.03044
作者: Tuan Truong,Chau Nguyen,Huy Nguyen,Minh Le,Trung Le,Nhat Ho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has emerged as a powerful method for fine-tuning large-scale foundation models. Despite its popularity, the theoretical understanding of LoRA has remained limited. This paper presents a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models. Under this framework, we show that simple reparameterizations of the LoRA matrices can notably accelerate the low-rank matrix estimation process. In particular, we prove that reparameterization can reduce the data needed to achieve a desired estimation error from an exponential to a polynomial scale. Motivated by this insight, we propose Reparameterized Low-rank Adaptation (RepLoRA), which incorporates lightweight MLPs to reparameterize the LoRA matrices. Extensive experiments across multiple domains demonstrate that RepLoRA consistently outperforms vanilla LoRA. Notably, with limited data, RepLoRA surpasses LoRA by a margin of up to 40.0% and achieves LoRA’s performance with only 30.0% of the training data, highlighting both the theoretical and empirical robustness of our PEFT method.

[LG-34] Large Language Models Are Universal Recommendation Learners

链接: https://arxiv.org/abs/2502.03041
作者: Junguang Jiang,Yanwen Huang,Bin Liu,Xiaoyu Kong,Ziru Xu,Han Zhu,Jian Xu,Bo Zheng
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In real-world recommender systems, different tasks are typically addressed using supervised learning on task-specific datasets with carefully designed model architectures. We demonstrate that large language models (LLMs) can function as universal recommendation learners, capable of handling multiple tasks within a unified input-output framework, eliminating the need for specialized model designs. To improve the recommendation performance of LLMs, we introduce a multimodal fusion module for item representation and a sequence-in-set-out approach for efficient candidate generation. When applied to industrial-scale data, our LLM achieves competitive results with expert models elaborately designed for different recommendation tasks. Furthermore, our analysis reveals that recommendation outcomes are highly sensitive to text input, highlighting the potential of prompt engineering in optimizing industrial-scale recommender systems.

[LG-35] Aggregate to Adapt: Node-Centric Aggregation for Multi-Source-Free Graph Domain Adaptation WWW-2025

链接: https://arxiv.org/abs/2502.03033
作者: Zhen Zhang,Bingsheng He
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW-2025

点击查看摘要

Abstract:Unsupervised graph domain adaptation (UGDA) focuses on transferring knowledge from labeled source graph to unlabeled target graph under domain discrepancies. Most existing UGDA methods are designed to adapt information from a single source domain, which cannot effectively exploit the complementary knowledge from multiple source domains. Furthermore, their assumptions that the labeled source graphs are accessible throughout the training procedure might not be practical due to privacy, regulation, and storage concerns. In this paper, we investigate multi-source-free unsupervised graph domain adaptation, i.e., adapting knowledge from multiple source domains to an unlabeled target domain without utilizing labeled source graphs but relying solely on source pre-trained models. Unlike previous multi-source domain adaptation approaches that aggregate predictions at model level, we introduce a novel model named GraphATA which conducts adaptation at node granularity. Specifically, we parameterize each node with its own graph convolutional matrix by automatically aggregating weight matrices from multiple source models according to its local context, thus realizing dynamic adaptation over graph structured data. We also demonstrate the capability of GraphATA to generalize to both model-centric and layer-centric methods. Comprehensive experiments on various public datasets show that our GraphATA can consistently surpass recent state-of-the-art baselines with different gains.

[LG-36] On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

链接: https://arxiv.org/abs/2502.03029
作者: Nghiem T. Diep,Huy Nguyen,Chau Nguyen,Minh Le,Duy M. H. Nguyen,Daniel Sonntag,Mathias Niepert,Nhat Ho
类目: Machine Learning (cs.LG)
*备注: 43 pages, 5 tables, 6 figures

点击查看摘要

Abstract:The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

[LG-37] Parametric Scaling Law of Tuning Bias in Conformal Prediction

链接: https://arxiv.org/abs/2502.03023
作者: Hao Zeng,Kangdao Liu,Bingyi Jing,Hongxin Wei
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Conformal prediction is a popular framework of uncertainty quantification that constructs prediction sets with coverage guarantees. To uphold the exchangeability assumption, many conformal prediction methods necessitate an additional holdout set for parameter tuning. Yet, the impact of violating this principle on coverage remains underexplored, making it ambiguous in practical applications. In this work, we empirically find that the tuning bias - the coverage gap introduced by leveraging the same dataset for tuning and calibration, is negligible for simple parameter tuning in many conformal prediction methods. In particular, we observe the scaling law of the tuning bias: this bias increases with parameter space complexity and decreases with calibration set size. Formally, we establish a theoretical framework to quantify the tuning bias and provide rigorous proof for the scaling law of the tuning bias by deriving its upper bound. In the end, we discuss how to reduce the tuning bias, guided by the theories we developed.

[LG-38] An Augmented Backward-Corrected Projector Splitting Integrator for Dynamical Low-Rank Training

链接: https://arxiv.org/abs/2502.03006
作者: Jonas Kusch,Steffen Schotthöfer,Alexandra Walter
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Layer factorization has emerged as a widely used technique for training memory-efficient neural networks. However, layer factorization methods face several challenges, particularly a lack of robustness during the training process. To overcome this limitation, dynamical low-rank training methods have been developed, utilizing robust time integration techniques for low-rank matrix differential equations. Although these approaches facilitate efficient training, they still depend on computationally intensive QR and singular value decompositions of matrices with small rank. In this work, we introduce a novel low-rank training method that reduces the number of required QR decompositions. Our approach integrates an augmentation step into a projector-splitting scheme, ensuring convergence to a locally optimal solution. We provide a rigorous theoretical analysis of the proposed method and demonstrate its effectiveness across multiple benchmarks.

[LG-39] Conformal Uncertainty Indicator for Continual Test-Time Adaptation

链接: https://arxiv.org/abs/2502.02998
作者: Fan Lyu,Hanyu Zhao,Ziqi Shi,Ye Liu,Fuyuan Hu,Zhang Zhang,Liang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) aims to adapt models to sequentially changing domains during testing, relying on pseudo-labels for self-adaptation. However, incorrect pseudo-labels can accumulate, leading to performance degradation. To address this, we propose a Conformal Uncertainty Indicator (CUI) for CTTA, leveraging Conformal Prediction (CP) to generate prediction sets that include the true label with a specified coverage probability. Since domain shifts can lower the coverage than expected, making CP unreliable, we dynamically compensate for the coverage by measuring both domain and data differences. Reliable pseudo-labels from CP are then selectively utilized to enhance adaptation. Experiments confirm that CUI effectively estimates uncertainty and improves adaptation performance across various existing CTTA methods.

[LG-40] Learning Efficient Flocking Control based on Gibbs Random Fields

链接: https://arxiv.org/abs/2502.02984
作者: Dengyu Zhang,Chenghao,Feng Xue,Qingrui Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 9 pages, 10 figures

点击查看摘要

Abstract:Flocking control is essential for multi-robot systems in diverse applications, yet achieving efficient flocking in congested environments poses challenges regarding computation burdens, performance optimality, and motion safety. This paper addresses these challenges through a multi-agent reinforcement learning (MARL) framework built on Gibbs Random Fields (GRFs). With GRFs, a multi-robot system is represented by a set of random variables conforming to a joint probability distribution, thus offering a fresh perspective on flocking reward design. A decentralized training and execution mechanism, which enhances the scalability of MARL concerning robot quantity, is realized using a GRF-based credit assignment method. An action attention module is introduced to implicitly anticipate the motion intentions of neighboring robots, consequently mitigating potential non-stationarity issues in MARL. The proposed framework enables learning an efficient distributed control policy for multi-robot systems in challenging environments with success rate around 99% , as demonstrated through thorough comparisons with state-of-the-art solutions in simulations and experiments. Ablation studies are also performed to validate the efficiency of different framework modules.

[LG-41] Label Anything: An Interpretable High-Fidelity and Prompt-Free Annotator ICRA2025

链接: https://arxiv.org/abs/2502.02972
作者: Wei-Bin Kou,Guangxu Zhu,Rongguang Ye,Shuai Wang,Ming Tang,Yik-Chung Wu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent automatic annotation. OptOU consists of multiple cascading layers and each layer contains an optimization formulation to align its output with the ground truth as closely as possible, though which OptOU acts as being interpretable rather than learning-based blackbox nature. In addition, training SCA and OptOU requires only a single pre-annotated RGB seed image, owing to their small volume of learnable parameters. Extensive experiments clearly demonstrate that the proposed LAM can generate high-fidelity annotations (almost 100% in mIoU) for multiple real-world datasets (i.e., Camvid, Cityscapes, and Apolloscapes) and CARLA simulation dataset.

[LG-42] Membership Inference Attack Should Move On to Distributional Statistics for Distilled Generative Models

链接: https://arxiv.org/abs/2502.02970
作者: Muxing Li,Zesheng Ye,Yixuan Li,Andy Song,Guangquan Zhang,Feng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Membership inference attacks (MIAs) determine whether certain data instances were used to train a model by exploiting the differences in how the model responds to seen versus unseen instances. This capability makes MIAs important in assessing privacy leakage within modern generative AI systems. However, this paper reveals an oversight in existing MIAs against \emphdistilled generative models: attackers can no longer detect a teacher model’s training instances individually when targeting the distilled student model, as the student learns from the teacher-generated data rather than its original member data, preventing direct instance-level memorization. Nevertheless, we find that student-generated samples exhibit a significantly stronger distributional alignment with teacher’s member data than non-member data. This leads us to posit that MIAs \emphon distilled generative models should shift from instance-level to distribution-level statistics. We thereby introduce a \emphset-based MIA framework that measures \emphrelative distributional discrepancies between student-generated data\emphsets and potential member/non-member data\emphsets, Empirically, distributional statistics reliably distinguish a teacher’s member data from non-member data through the distilled model. Finally, we discuss scenarios in which our setup faces limitations.

[LG-43] Direct Distributional Optimization for Provable Alignment of Diffusion Models

链接: https://arxiv.org/abs/2502.02954
作者: Ryotaro Kawata,Kazusato Oko,Atsushi Nitanda,Taiji Suzuki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel alignment method for diffusion models from distribution optimization perspectives while providing rigorous convergence guarantees. We first formulate the problem as a generic regularized loss minimization over probability distributions and directly optimize the distribution using the Dual Averaging method. Next, we enable sampling from the learned distribution by approximating its score function via Doob’s h -transform technique. The proposed framework is supported by rigorous convergence guarantees and an end-to-end bound on the sampling error, which imply that when the original distribution’s score is known accurately, the complexity of sampling from shifted distributions is independent of isoperimetric conditions. This framework is broadly applicable to general distribution optimization problems, including alignment tasks in Reinforcement Learning with Human Feedback (RLHF), Direct Preference Optimization (DPO), and Kahneman-Tversky Optimization (KTO). We empirically validate its performance on synthetic and image datasets using the DPO objective.

[LG-44] Behavioral Homophily in Social Media via Inverse Reinforcement Learning: A Reddit Case Study

链接: https://arxiv.org/abs/2502.02943
作者: Lanqin Yuan,Philipp J. Schneider,Marian-Andrei Rizoiu
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online communities play a critical role in shaping societal discourse and influencing collective behavior in the real world. The tendency for people to connect with others who share similar characteristics and views, known as homophily, plays a key role in the formation of echo chambers which further amplify polarization and division. Existing works examining homophily in online communities traditionally infer it using content- or adjacency-based approaches, such as constructing explicit interaction networks or performing topic analysis. These methods fall short for platforms where interaction networks cannot be easily constructed and fail to capture the complex nature of user interactions across the platform. This work introduces a novel approach for quantifying user homophily. We first use an Inverse Reinforcement Learning (IRL) framework to infer users’ policies, then use these policies as a measure of behavioral homophily. We apply our method to Reddit, conducting a case study across 5.9 million interactions over six years, demonstrating how this approach uncovers distinct behavioral patterns and user roles that vary across different communities. We further validate our behavioral homophily measure against traditional content-based homophily, offering a powerful method for analyzing social media dynamics and their broader societal implications. We find, among others, that users can behave very similarly (high behavioral homophily) when discussing entirely different topics like soccer vs e-sports (low topical homophily), and that there is an entire class of users on Reddit whose purpose seems to be to disagree with others.

[LG-45] Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization FAST NEURIPS2024

链接: https://arxiv.org/abs/2502.02941
作者: Yang Li,Jinpei Guo,Runzhong Wang,Hongyuan Zha,Junchi Yan
类目: Machine Learning (cs.LG)
*备注: Published at NeurIPS 2024, the implementation code is available at this https URL

点击查看摘要

Abstract:Diffusion models have recently advanced Combinatorial Optimization (CO) as a powerful backbone for neural solvers. However, their iterative sampling process requiring denoising across multiple noise levels incurs substantial overhead. We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots. This is achieved through an optimization consistency training protocol, which, for a given instance, minimizes the difference among samples originating from varying generative trajectories and time steps relative to the optimal solution. The proposed model enables fast single-step solution generation while retaining the option of multi-step sampling to trade for sampling quality, which offers a more effective and efficient alternative backbone for neural solvers. In addition, within the training-to-testing (T2T) framework, to bridge the gap between training on historical instances and solving new instances, we introduce a novel consistency-based gradient search scheme during the test stage, enabling more effective exploration of the solution space learned during training. It is achieved by updating the latent solution probabilities under objective gradient guidance during the alternation of noise injection and denoising steps. We refer to this model as Fast T2T. Extensive experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency, even outperforming LKH given limited time budgets. Notably, Fast T2T with merely one-step generation and one-step gradient search can mostly outperform the SOTA diffusion-based counterparts that require hundreds of steps, while achieving tens of times speedup.

[LG-46] Robust Reward Alignment in Hypothesis Space

链接: https://arxiv.org/abs/2502.02921
作者: Zhixian Xie,Haode Zhang,Yizhe Feng,Wanxin Jin
类目: Machine Learning (cs.LG)
*备注: 17 pages, including appendix

点击查看摘要

Abstract:Reward design for reinforcement learning and optimal control agents is challenging. Preference-based alignment addresses this by enabling agents to learn rewards from ranked trajectory pairs provided by humans. However, existing methods often struggle from poor robustness to unknown false human preferences. In this work, we propose a robust and efficient reward alignment method based on a novel and geometrically interpretable perspective: hypothesis space batched cutting. Our method iteratively refines the reward hypothesis space through “cuts” based on batches of human preferences. Within each batch, human preferences, queried based on disagreement, are grouped using a voting function to determine the appropriate cut, ensuring a bounded human query complexity. To handle unknown erroneous preferences, we introduce a conservative cutting method within each batch, preventing erroneous human preferences from making overly aggressive cuts to the hypothesis space. This guarantees provable robustness against false preferences. We evaluate our method in a model predictive control setting across diverse tasks, including DM-Control, dexterous in-hand manipulation, and locomotion. The results demonstrate that our framework achieves comparable or superior performance to state-of-the-art methods in error-free settings while significantly outperforming existing method when handling high percentage of erroneous human preferences.

[LG-47] Privacy Token: Surprised to Find Out What You Accidentally Revealed

链接: https://arxiv.org/abs/2502.02913
作者: Jiayang Meng,Tao Huang,Xin Shi,Qingyu Huang,Chen Hou,Hong Chen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The widespread deployment of deep learning models in privacy-sensitive domains has amplified concerns regarding privacy risks, particularly those stemming from gradient leakage during training. Current privacy assessments primarily rely on post-training attack simulations. However, these methods are inherently reactive, unable to encompass all potential attack scenarios, and often based on idealized adversarial assumptions. These limitations underscore the need for proactive approaches to privacy risk assessment during the training process. To address this gap, we propose the concept of privacy tokens, which are derived directly from private gradients during training. Privacy tokens encapsulate gradient features and, when combined with data features, offer valuable insights into the extent of private information leakage from training data, enabling real-time measurement of privacy risks without relying on adversarial attack simulations. Additionally, we employ Mutual Information (MI) as a robust metric to quantify the relationship between training data and gradients, providing precise and continuous assessments of privacy leakage throughout the training process. Extensive experiments validate our framework, demonstrating the effectiveness of privacy tokens and MI in identifying and quantifying privacy risks. This proactive approach marks a significant advancement in privacy monitoring, promoting the safer deployment of deep learning models in sensitive applications.

[LG-48] DANDI: Diffusion as Normative Distribution for Deep Neural Network Input

链接: https://arxiv.org/abs/2502.02910
作者: Somin Kim,Shin Yoo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: DeepTest 2025 Workshop

点击查看摘要

Abstract:Surprise Adequacy (SA) has been widely studied as a test adequacy metric that can effectively guide software engineers towards inputs that are more likely to reveal unexpected behaviour of Deep Neural Networks (DNNs). Intuitively, SA is an out-of-distribution metric that quantifies the dissimilarity between the given input and the training data: if a new input is very different from those seen during training, the DNN is more likely to behave unexpectedly against the input. While SA has been widely adopted as a test prioritization method, its major weakness is the fact that the computation of the metric requires access to the training dataset, which is often not allowed in real-world use cases. We present DANDI, a technique that generates a surrogate input distribution using Stable Diffusion to compute SA values without requiring the original training data. An empirical evaluation of DANDI applied to image classifiers for CIFAR10 and ImageNet-1K shows that SA values computed against synthetic data are highly correlated with the values computed against the training data, with Spearman Rank correlation value of 0.852 for ImageNet-1K and 0.881 for CIFAR-10. Further, we show that SA value computed by DANDI achieves can prioritize inputs as effectively as those computed using the training data, when testing DNN models mutated by DeepMutation. We believe that DANDI can significantly improve the usability of SA for practical DNN testing.

[LG-49] COSMosFL: Ensemble of Small Language Models for Fault Localisation

链接: https://arxiv.org/abs/2502.02908
作者: Hyunjoon Cho,Sungmin Kang,Gabin An,Shin Yoo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: LLM4Code 2025 Workshop

点击查看摘要

Abstract:LLMs are rapidly being adopted to build powerful tools and agents for software engineering, but most of them rely heavily on extremely large closed-source models. This, in turn, can hinder wider adoption due to security issues as well as financial cost and environmental impact. Recently, a number of open source Small Language Models (SLMs) are being released and gaining traction. While SLMs are smaller, more energy-efficient, and therefore easier to locally deploy, they tend to show worse performance when compared to larger closed LLMs. We present COSMos, a task-level LLM ensemble technique that uses voting mechanism, to provide a broader range of choice between SLMs and LLMs. We instantiate COSMos with an LLM-based Fault Localisation technique, AutoFL, and report the cost-benefit trade-off between LLM accuracy and various costs such as energy consumption, inference time, and the number of tokens used. An empirical evaluation using Defects4J shows that COSMos can build effective ensembles that can achieve Pareto-optimality in terms of FL accuracy and inference cost, when compared to individual models.

[LG-50] Variations on the Expectation Due to Changes in the Probability Measure

链接: https://arxiv.org/abs/2502.02887
作者: Samir M. Perlaza,Gaetan Bisson
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: Submitted to the IEEE International Symposium on Information Theory (ISIT2025)

点击查看摘要

Abstract:Closed-form expressions are presented for the variation of the expectation of a given function due to changes in the probability measure used for the expectation. They unveil interesting connections with Gibbs probability measures, the mutual information, and the lautum information.

[LG-51] PH-VAE: A Polynomial Hierarchical Variational Autoencoder Towards Disentangled Representation Learning

链接: https://arxiv.org/abs/2502.02856
作者: Xi Chen,Shaofan Li
类目: Machine Learning (cs.LG)
*备注: 15 pages,14 figures

点击查看摘要

Abstract:The variational autoencoder (VAE) is a simple and efficient generative artificial intelligence method for modeling complex probability distributions of various types of data, such as images and texts. However, it suffers some main shortcomings, such as lack of interpretability in the latent variables, difficulties in tuning hyperparameters while training, producing blurry, unrealistic downstream outputs or loss of information due to how it calculates loss functions and recovers data distributions, overfitting, and origin gravity effect for small data sets, among other issues. These and other limitations have caused unsatisfactory generation effects for the data with complex distributions. In this work, we proposed and developed a polynomial hierarchical variational autoencoder (PH-VAE), in which we used a polynomial hierarchical date format to generate or to reconstruct the data distributions. In doing so, we also proposed a novel Polynomial Divergence in the loss function to replace or generalize the Kullback-Leibler (KL) divergence, which results in systematic and drastic improvements in both accuracy and reproducibility of the re-constructed distribution function as well as the quality of re-constructed data images while keeping the dataset size the same but capturing fine resolution of the data. Moreover, we showed that the proposed PH-VAE has some form of disentangled representation learning ability.

[LG-52] D3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation

链接: https://arxiv.org/abs/2502.02854
作者: Jiaqing Zhang,Mingjia Yin,Hao Wang,Yawen Li,Yuyang Ye,Xingyu Lou,Junping Du,Enhong Chen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of data-centric AI, the focus of recommender systems has shifted from model-centric innovations to data-centric approaches. The success of modern AI models is built on large-scale datasets, but this also results in significant training costs. Dataset distillation has emerged as a key solution, condensing large datasets to accelerate model training while preserving model performance. However, condensing discrete and sequentially correlated user-item interactions, particularly with extensive item sets, presents considerable challenges. This paper introduces \textbfTD3, a novel \textbfTucker \textbfDecomposition based \textbfDataset \textbfDistillation method within a meta-learning framework, designed for sequential recommendation. TD3 distills a fully expressive \emphsynthetic sequence summary from original data. To efficiently reduce computational complexity and extract refined latent patterns, Tucker decomposition decouples the summary into four factors: \emphsynthetic user latent factor, \emphtemporal dynamics latent factor, \emphshared item latent factor, and a \emphrelation core that models their interconnections. Additionally, a surrogate objective in bi-level optimization is proposed to align feature spaces extracted from models trained on both original data and synthetic sequence summary beyond the naïve performance matching approach. In the \emphinner-loop, an augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the \emphouter-loop. To accelerate the optimization process and address long dependencies, RaT-BPTT is employed for bi-level optimization. Experiments and analyses on multiple public datasets have confirmed the superiority and cross-architecture generalizability of the proposed designs. Codes are released at this https URL.

[LG-53] Rethinking Latent Representations in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation

链接: https://arxiv.org/abs/2502.02853
作者: Shuanghai Bai,Wanqi Zhou,Pengxiang Ding,Wei Zhao,Donglin Wang,Badong Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:Behavior Cloning (BC) is a widely adopted visual imitation learning method in robot manipulation. Current BC approaches often enhance generalization by leveraging large datasets and incorporating additional visual and textual modalities to capture more diverse information. However, these methods overlook whether the learned representations contain redundant information and lack a solid theoretical foundation to guide the learning process. To address these limitations, we adopt an information-theoretic perspective and introduce mutual information to quantify and mitigate redundancy in latent representations. Building on this, we incorporate the Information Bottleneck (IB) principle into BC, which extends the idea of reducing redundancy by providing a structured framework for compressing irrelevant information while preserving task-relevant features. This work presents the first comprehensive study on redundancy in latent representations across various methods, backbones, and experimental settings, while extending the generalizability of the IB to BC. Extensive experiments and analyses on the CortexBench and LIBERO benchmarks demonstrate significant performance improvements with IB, underscoring the importance of reducing input data redundancy and highlighting its practical value for more practical applications. Project Page: this https URL.

[LG-54] Multimodal Brain-Computer Interfaces: AI-powered Decoding Methodologies

链接: https://arxiv.org/abs/2502.02830
作者: Siyang Li,Hongbin Wang,Xiaoqing Chen,Dongrui Wu
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices. This review highlights the core decoding algorithms that enable multimodal BCIs, including a dissection of the elements, a unified view of diversified approaches, and a comprehensive analysis of the present state of the field. We emphasize algorithmic advancements in cross-modality mapping, sequential modeling, besides classic multi-modality fusion, illustrating how these novel AI approaches enhance decoding of brain data. The current literature of BCI applications on visual, speech, and affective decoding are comprehensively explored. Looking forward, we draw attention on the impact of emerging architectures like multimodal Transformers, and discuss challenges such as brain data heterogeneity and common errors. This review also serves as a bridge in this interdisciplinary field for experts with neuroscience background and experts that study AI, aiming to provide a comprehensive understanding for AI-powered multimodal BCIs.

[LG-55] Slowing Learning by Erasing Simple Features

链接: https://arxiv.org/abs/2502.02820
作者: Lucia Quirke,Nora Belrose
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prior work suggests that neural networks tend to learn low-order moments of the data distribution first, before moving on to higher-order correlations. In this work, we derive a novel closed-form concept erasure method, QLEACE, which surgically removes all quadratically available information about a concept from a representation. Through comparisons with linear erasure (LEACE) and two approximate forms of quadratic erasure, we explore whether networks can still learn when low-order statistics are removed from image classification datasets. We find that while LEACE consistently slows learning, quadratic erasure can exhibit both positive and negative effects on learning speed depending on the choice of dataset, model architecture, and erasure method. Use of QLEACE consistently slows learning in feedforward architectures, but more sophisticated architectures learn to use injected higher order Shannon information about class labels. Its approximate variants avoid injecting information, but surprisingly act as data augmentation techniques on some datasets, enhancing learning speed compared to LEACE. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.02820 [cs.LG] (or arXiv:2502.02820v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.02820 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Accessible and Portable LLM Inference by Compiling Computational Graphs into SQL

链接: https://arxiv.org/abs/2502.02818
作者: Wenbo Sun,Qiming Guo,Wenlu Wang,Rihan Hai
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Serving large language models (LLMs) often demands specialized hardware, dedicated frameworks, and substantial development efforts, which restrict their accessibility, especially for edge devices and organizations with limited technical resources. We propose a novel compiler that translates LLM inference graphs into SQL queries, enabling relational databases, one of the most widely used and mature software systems globally, to serve as the runtime. By mapping neural operators such as matrix multiplication and attention into relational primitives like joins and aggregations, our approach leverages database capabilities, including disk-based data management and native caching. Supporting key transformer components, such as attention mechanisms and key-value caching, our system generates SQL pipelines for end-to-end LLM inference. Using the Llama3 family as a case study, we demonstrate up to 30x speedup in token generation for memory-constrained scenarios comparable to competitive CPU-based frameworks. Our work offers an accessible, portable, and efficient solution, facilitating the serving of LLMs across diverse deployment environments.

[LG-57] When Machine Learning Gets Personal: Understanding Fairness of Personalized Models ICML2025

链接: https://arxiv.org/abs/2502.02786
作者: Louisa Cornelis,Guillermo Bernárdez,Haewon Jeong,Nina Miolane
类目: Machine Learning (cs.LG)
*备注: 35 pages, 9 figures, submitted to ICML 2025

点击查看摘要

Abstract:Personalization in machine learning involves tailoring models to individual users by incorporating personal attributes such as demographic or medical data. While personalization can improve prediction accuracy, it may also amplify biases and reduce explainability. This work introduces a unified framework to evaluate the impact of personalization on both prediction accuracy and explanation quality across classification and regression tasks. We derive novel upper bounds for the number of personal attributes that can be used to reliably validate benefits of personalization. Our analysis uncovers key trade-offs. We show that regression models can potentially utilize more personal attributes than classification models. We also demonstrate that improvements in prediction accuracy due to personalization do not necessarily translate to enhanced explainability – underpinning the importance to evaluate both metrics when personalizing machine learning models in critical settings such as healthcare. Validated with a real-world dataset, this framework offers practical guidance for balancing accuracy, fairness, and interpretability in personalized models.

[LG-58] OpenSTARLab: Open Approach for Spatio-Temporal Agent Data Analysis in Soccer

链接: https://arxiv.org/abs/2502.02785
作者: Calvin Yeung,Kenjiro Ide,Taiga Someya,Keisuke Fujii
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sports analytics has become both more professional and sophisticated, driven by the growing availability of detailed performance data. This progress enables applications such as match outcome prediction, player scouting, and tactical analysis. In soccer, the effective utilization of event and tracking data is fundamental for capturing and analyzing the dynamics of the game. However, there are two primary challenges: the limited availability of event data, primarily restricted to top-tier teams and leagues, and the scarcity and high cost of tracking data, which complicates its integration with event data for comprehensive analysis. Here we propose OpenSTARLab, an open-source framework designed to democratize spatio-temporal agent data analysis in sports by addressing these key challenges. OpenSTARLab includes the Pre-processing Package that standardizes event and tracking data through Unified and Integrated Event Data and State-Action-Reward formats, the Event Modeling Package that implements deep learning-based event prediction, alongside the RLearn Package for reinforcement learning tasks. These technical components facilitate the handling of diverse data sources and support advanced analytical tasks, thereby enhancing the overall functionality and usability of the framework. To assess OpenSTARLab’s effectiveness, we conducted several experimental evaluations. These demonstrate the superior performance of the specific event prediction model in terms of action and time prediction accuracies and maintained its robust event simulation performance. Furthermore, reinforcement learning experiments reveal a trade-off between action accuracy and temporal difference loss and show comprehensive visualization. Overall, OpenSTARLab serves as a robust platform for researchers and practitioners, enhancing innovation and collaboration in the field of soccer data analytics.

[LG-59] heoretical Guarantees for Low-Rank Compression of Deep Neural Networks

链接: https://arxiv.org/abs/2502.02766
作者: Shihao Zhang,Rayan Saab
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Deep neural networks have achieved state-of-the-art performance across numerous applications, but their high memory and computational demands present significant challenges, particularly in resource-constrained environments. Model compression techniques, such as low-rank approximation, offer a promising solution by reducing the size and complexity of these networks while only minimally sacrificing accuracy. In this paper, we develop an analytical framework for data-driven post-training low-rank compression. We prove three recovery theorems under progressively weaker assumptions about the approximate low-rank structure of activations, modeling deviations via noise. Our results represent a step toward explaining why data-driven low-rank compression methods outperform data-agnostic approaches and towards theoretically grounded compression algorithms that reduce inference costs while maintaining performance.

[LG-60] LLM -USO: Large Language Model-based Universal Sizing Optimizer

链接: https://arxiv.org/abs/2502.02764
作者: Karthik Somayaji N.S,Peng Li
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The design of analog circuits is a cornerstone of integrated circuit (IC) development, requiring the optimization of complex, interconnected sub-structures such as amplifiers, comparators, and buffers. Traditionally, this process relies heavily on expert human knowledge to refine design objectives by carefully tuning sub-components while accounting for their interdependencies. Existing methods, such as Bayesian Optimization (BO), offer a mathematically driven approach for efficiently navigating large design spaces. However, these methods fall short in two critical areas compared to human expertise: (i) they lack the semantic understanding of the sizing solution space and its direct correlation with design objectives before optimization, and (ii) they fail to reuse knowledge gained from optimizing similar sub-structures across different circuits. To overcome these limitations, we propose the Large Language Model-based Universal Sizing Optimizer (LLM-USO), which introduces a novel method for knowledge representation to encode circuit design knowledge in a structured text format. This representation enables the systematic reuse of optimization insights for circuits with similar sub-structures. LLM-USO employs a hybrid framework that integrates BO with large language models (LLMs) and a learning summary module. This approach serves to: (i) infuse domain-specific knowledge into the BO process and (ii) facilitate knowledge transfer across circuits, mirroring the cognitive strategies of expert designers. Specifically, LLM-USO constructs a knowledge summary mechanism to distill and apply design insights from one circuit to related ones. It also incorporates a knowledge summary critiquing mechanism to ensure the accuracy and quality of the summaries and employs BO-guided suggestion filtering to identify optimal design points efficiently.

[LG-61] ReGNet: Reciprocal Space-Aware Long-Range Modeling and Multi-Property Prediction for Crystals

链接: https://arxiv.org/abs/2502.02748
作者: Jianan Nie,Peiyao Xiao,Kaiyi Ji,Peng Gao
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Predicting properties of crystals from their structures is a fundamental yet challenging task in materials science. Unlike molecules, crystal structures exhibit infinite periodic arrangements of atoms, requiring methods capable of capturing both local and global information effectively. However, most current works fall short of capturing long-range interactions within periodic structures. To address this limitation, we leverage reciprocal space to efficiently encode long-range interactions with learnable filters within Fourier transforms. We introduce Reciprocal Geometry Network (ReGNet), a novel architecture that integrates geometric GNNs and reciprocal blocks to model short-range and long-range interactions, respectively. Additionally, we introduce ReGNet-MT, a multi-task extension that employs mixture of experts (MoE) for multi-property prediction. Experimental results on the JARVIS and Materials Project benchmarks demonstrate that ReGNet achieves significant performance improvements. Moreover, ReGNet-MT attains state-of-the-art results on two bandgap properties due to positive transfer, while maintaining high computational efficiency. These findings highlight the potential of our model as a scalable and accurate solution for crystal property prediction. The code will be released upon paper acceptance.

[LG-62] LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing

链接: https://arxiv.org/abs/2502.02743
作者: Yang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement in large language models (LLMs) has brought forth a diverse range of models with varying capabilities that excel in different tasks and domains. However, selecting the optimal LLM for user queries often involves a challenging trade-off between accuracy and cost, a problem exacerbated by the diverse demands of individual queries. In this work, we present a novel framework that formulates the LLM selection process as a multi-armed bandit problem, enabling dynamic and intelligent routing of queries to the most appropriate model. Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time, thereby offering a customizable balance between performance and cost. Additionally, our selection policy is designed to generalize to unseen LLMs, ensuring adaptability to new models as they emerge. Experimental results demonstrate that our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms, showcasing the potential of our framework to adaptively optimize LLM selection in real-world scenarios.

[LG-63] Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives

链接: https://arxiv.org/abs/2502.02723
作者: Qinsi Wang,Jinghan Ke,Masayoshi Tomizuka,Yiran Chen,Kurt Keutzer,Chenfeng Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We provide a new LLM-compression solution via SVD, unlocking new possibilities for LLM compression beyond quantization and pruning. We point out that the optimal use of SVD lies in truncating activations, rather than merely using activations as an optimization distance. Building on this principle, we address three critical challenges in SVD-based LLM compression: including (1) How can we determine the optimal activation truncation position for each weight matrix in LLMs? (2) How can we efficiently reconstruct the weight matrices based on truncated activations? (3) How can we address the inherent “injection” nature that results in the information loss of the SVD? We propose Dobi-SVD, which establishes a new, principled approach to SVD-based LLM compression.

[LG-64] Beyond Topological Self-Explainable GNNs: A Formal Explainability Perspective

链接: https://arxiv.org/abs/2502.02719
作者: Steve Azzolin,Sagar Malhotra,Andrea Passerini,Stefano Teso
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-Explainable Graph Neural Networks (SE-GNNs) are popular explainable-by-design GNNs, but the properties and the limitations of their explanations are not well understood. Our first contribution fills this gap by formalizing the explanations extracted by SE-GNNs, referred to as Trivial Explanations (TEs), and comparing them to established notions of explanations, namely Prime Implicant (PI) and faithful explanations. Our analysis reveals that TEs match PI explanations for a restricted but significant family of tasks. In general, however, they can be less informative than PI explanations and are surprisingly misaligned with widely accepted notions of faithfulness. Although faithful and PI explanations are informative, they are intractable to find and we show that they can be prohibitively large. Motivated by this, we propose Dual-Channel GNNs that integrate a white-box rule extractor and a standard SE-GNN, adaptively combining both channels when the task benefits. Our experiments show that even a simple instantiation of Dual-Channel GNNs can recover succinct rules and perform on par or better than widely used SE-GNNs. Our code can be found in the supplementary material.

[LG-65] Rapidly Adapting Policies to the Real World via Simulation-Guided Fine-Tuning

链接: https://arxiv.org/abs/2502.02705
作者: Patrick Yin,Tyler Westenbroek,Simran Bagaria,Kevin Huang,Ching-an Cheng,Andrey Kobolov,Abhishek Gupta
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robot learning requires a considerable amount of high-quality data to realize the promise of generalization. However, large data sets are costly to collect in the real world. Physics simulators can cheaply generate vast data sets with broad coverage over states, actions, and environments. However, physics engines are fundamentally misspecified approximations to reality. This makes direct zero-shot transfer from simulation to reality challenging, especially in tasks where precise and force-sensitive manipulation is necessary. Thus, fine-tuning these policies with small real-world data sets is an appealing pathway for scaling robot learning. However, current reinforcement learning fine-tuning frameworks leverage general, unstructured exploration strategies which are too inefficient to make real-world adaptation practical. This paper introduces the Simulation-Guided Fine-tuning (SGFT) framework, which demonstrates how to extract structural priors from physics simulators to substantially accelerate real-world adaptation. Specifically, our approach uses a value function learned in simulation to guide real-world exploration. We demonstrate this approach across five real-world dexterous manipulation tasks where zero-shot sim-to-real transfer fails. We further demonstrate our framework substantially outperforms baseline fine-tuning methods, requiring up to an order of magnitude fewer real-world samples and succeeding at difficult tasks where prior approaches fail entirely. Last but not least, we provide theoretical justification for this new paradigm which underpins how SGFT can rapidly learn high-performance policies in the face of large sim-to-real dynamics gaps. Project webpage: this https URLthis http URL

[LG-66] Scalable Higher Resolution Polar Sea Ice Classification and Freeboard Calculation from ICESat-2 ATL03 Data

链接: https://arxiv.org/abs/2502.02700
作者: Jurdana Masuma Iqrah,Younghyun Koo,Wei Wang,Hongjie Xie,Sushil K. Prasad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:ICESat-2 (IS2) by NASA is an Earth-observing satellite that measures high-resolution surface elevation. The IS2’s ATL07 and ATL10 sea ice elevation and freeboard products of 10m-200m segments which aggregated 150 signal photons from the raw ATL03 (geolocated photon) data. These aggregated products can potentially overestimate local sea surface height, thus underestimating the calculations of freeboard (sea ice height above sea surface). To achieve a higher resolution of sea surface height and freeboard information, in this work we utilize a 2m window to resample the ATL03 data. Then, we classify these 2m segments into thick sea ice, thin ice, and open water using deep learning methods (Long short-term memory and Multi-layer perceptron models). To obtain labeled training data for our deep learning models, we use segmented Sentinel-2 (S2) multi-spectral imagery overlapping with IS2 tracks in space and time to auto-label IS2 data, followed by some manual corrections in the regions of transition between different ice/water types or cloudy regions. We employ a parallel workflow for this auto-labeling using PySpark to scale, and we achieve 9-fold data loading and 16.25-fold map-reduce speedup. To train our models, we employ a Horovod-based distributed deep-learning workflow on a DGX A100 8 GPU cluster, achieving a 7.25-fold speedup. Next, we calculate the local sea surface heights based on the open water segments. Finally, we scale the freeboard calculation using the derived local sea level and achieve 8.54-fold data loading and 15.7-fold map-reduce speedup. Compared with the ATL07 (local sea level) and ATL10 (freeboard) data products, our results show higher resolutions and accuracy (96.56%).

[LG-67] Pseudo-Physics-Informed Neural Operators: Enhancing Operator Learning from Limited Data

链接: https://arxiv.org/abs/2502.02682
作者: Keyan Chen,Yile Li,Da Long,Zhitong Xu,Wei Xing,Jacob Hochhalter,Shandian Zhe
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural operators have shown great potential in surrogate modeling. However, training a well-performing neural operator typically requires a substantial amount of data, which can pose a major challenge in complex applications. In such scenarios, detailed physical knowledge can be unavailable or difficult to obtain, and collecting extensive data is often prohibitively expensive. To mitigate this challenge, we propose the Pseudo Physics-Informed Neural Operator (PPI-NO) framework. PPI-NO constructs a surrogate physics system for the target system using partial differential equations (PDEs) derived from simple, rudimentary physics principles, such as basic differential operators. This surrogate system is coupled with a neural operator model, using an alternating update and learning process to iteratively enhance the model’s predictive power. While the physics derived via PPI-NO may not mirror the ground-truth underlying physical laws – hence the term ``pseudo physics’’ – this approach significantly improves the accuracy of standard operator learning models in data-scarce scenarios, which is evidenced by extensive evaluations across five benchmark tasks and a fatigue modeling application.

[LG-68] Recovering Imbalanced Clusters via Gradient-Based Projection Pursuit

链接: https://arxiv.org/abs/2502.02668
作者: Martin Eppert,Satyaki Mukherjee,Debarghya Ghoshdastidar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Projection Pursuit is a classic exploratory technique for finding interesting projections of a dataset. We propose a method for recovering projections containing either Imbalanced Clusters or a Bernoulli-Rademacher distribution using a gradient-based technique to optimize the projection index. As sample complexity is a major limiting factor in Projection Pursuit, we analyze our algorithm’s sample complexity within a Planted Vector setting where we can observe that Imbalanced Clusters can be recovered more easily than balanced ones. Additionally, we give a generalized result that works for a variety of data distributions and projection indices. We compare these results to computational lower bounds in the Low-Degree-Polynomial Framework. Finally, we experimentally evaluate our method’s applicability to real-world data using FashionMNIST and the Human Activity Recognition Dataset, where our algorithm outperforms others when only a few samples are available.

[LG-69] Learning to Double Guess: An Active Perception Approach for Estimating the Center of Mass of Arbitrary Objects ICRA25

链接: https://arxiv.org/abs/2502.02663
作者: Shengmiao Jin,Yuchen Mo,Wenzhen Yuan
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to ICRA 25; 7 pages, 5 figures

点击查看摘要

Abstract:Manipulating arbitrary objects in unstructured environments is a significant challenge in robotics, primarily due to difficulties in determining an object’s center of mass. This paper introduces U-GRAPH: Uncertainty-Guided Rotational Active Perception with Haptics, a novel framework to enhance the center of mass estimation using active perception. Traditional methods often rely on single interaction and are limited by the inherent inaccuracies of Force-Torque (F/T) sensors. Our approach circumvents these limitations by integrating a Bayesian Neural Network (BNN) to quantify uncertainty and guide the robotic system through multiple, information-rich interactions via grid search and a neural network that scores each action. We demonstrate the remarkable generalizability and transferability of our method with training on a small dataset with limited variation yet still perform well on unseen complex real-world objects.

[LG-70] Bayesian Parameter Shift Rule in Variational Quantum Eigensolvers

链接: https://arxiv.org/abs/2502.02625
作者: Samuele Pedrielli,Christopher J. Anders,Lena Funcke,Karl Jansen,Kim A. Nicoli,Shinichi Nakajima
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages, 5 figures. arXiv admin note: text overlap with arXiv:2502.01704

点击查看摘要

Abstract:Parameter shift rules (PSRs) are key techniques for efficient gradient estimation in variational quantum eigensolvers (VQEs). In this paper, we propose its Bayesian variant, where Gaussian processes with appropriate kernels are used to estimate the gradient of the VQE objective. Our Bayesian PSR offers flexible gradient estimation from observations at arbitrary locations with uncertainty information and reduces to the generalized PSR in special cases. In stochastic gradient descent (SGD), the flexibility of Bayesian PSR allows the reuse of observations in previous steps, which accelerates the optimization process. Furthermore, the accessibility to the posterior uncertainty, along with our proposed notion of gradient confident region (GradCoRe), enables us to minimize the observation costs in each SGD step. Our numerical experiments show that the VQE optimization with Bayesian PSR and GradCoRe significantly accelerates SGD and outperforms the state-of-the-art methods, including sequential minimal optimization.

[LG-71] Physically Interpretable Representation and Controlled Generation for Turbulence Data

链接: https://arxiv.org/abs/2502.02605
作者: Tiffany Fan,Murray Cutforth,Marta D’Elia,Alexandre Cortiella,Alireza Doostan,Eric Darve
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Computational Fluid Dynamics (CFD) plays a pivotal role in fluid mechanics, enabling precise simulations of fluid behavior through partial differential equations (PDEs). However, traditional CFD methods are resource-intensive, particularly for high-fidelity simulations of complex flows, which are further complicated by high dimensionality, inherent stochasticity, and limited data availability. This paper addresses these challenges by proposing a data-driven approach that leverages a Gaussian Mixture Variational Autoencoder (GMVAE) to encode high-dimensional scientific data into low-dimensional, physically meaningful representations. The GMVAE learns a structured latent space where data can be categorized based on physical properties such as the Reynolds number while maintaining global physical consistency. To assess the interpretability of the learned representations, we introduce a novel metric based on graph spectral theory, quantifying the smoothness of physical quantities along the latent manifold. We validate our approach using 2D Navier-Stokes simulations of flow past a cylinder over a range of Reynolds numbers. Our results demonstrate that the GMVAE provides improved clustering, meaningful latent structure, and robust generative capabilities compared to baseline dimensionality reduction methods. This framework offers a promising direction for data-driven turbulence modeling and broader applications in computational fluid dynamics and engineering systems.

[LG-72] Linearized Optimal Transport pyLOT Library: A Toolkit for Machine Learning on Point Clouds

链接: https://arxiv.org/abs/2502.03439
作者: Jun Linwu,Varun Khurana,Nicholas Karris,Alexander Cloninger
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Software (cs.MS); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The pyLOT library offers a Python implementation of linearized optimal transport (LOT) techniques and methods to use in downstream tasks. The pipeline embeds probability distributions into a Hilbert space via the Optimal Transport maps from a fixed reference distribution, and this linearization allows downstream tasks to be completed using off the shelf (linear) machine learning algorithms. We provide a case study of performing ML on 3D scans of lemur teeth, where the original questions of classification, clustering, dimension reduction, and data generation reduce to simple linear operations performed on the LOT embedded representations.

[LG-73] aking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

链接: https://arxiv.org/abs/2502.03435
作者: Yu-Han Wu,Pierre Marion,Gérard Biau,Claire Boyer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score–the exact solution to the denoising score matching–leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.

[LG-74] Optimal Task Order for Continual Learning of Multiple Tasks

链接: https://arxiv.org/abs/2502.03350
作者: Ziyan Li,Naoki Hiratani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning of multiple tasks remains a major challenge for neural networks. Here, we investigate how task order influences continual learning and propose a strategy for optimizing it. Leveraging a linear teacher-student model with latent factors, we derive an analytical expression relating task similarity and ordering to learning performance. Our analysis reveals two principles that hold under a wide parameter range: (1) tasks should be arranged from the least representative to the most typical, and (2) adjacent tasks should be dissimilar. We validate these rules on both synthetic data and real-world image classification datasets (Fashion-MNIST, CIFAR-10, CIFAR-100), demonstrating consistent performance improvements in both multilayer perceptrons and convolutional neural networks. Our work thus presents a generalizable framework for task-order optimization in task-incremental continual learning.

[LG-75] A Mixture-Based Framework for Guiding Diffusion Models

链接: https://arxiv.org/abs/2502.03332
作者: Yazid Janati,Badr Moufad,Mehdi Abou El Qassime,Alain Durmus,Eric Moulines,Jimmy Olsson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Denoising diffusion models have driven significant progress in the field of Bayesian inverse problems. Recent approaches use pre-trained diffusion models as priors to solve a wide range of such problems, only leveraging inference-time compute and thereby eliminating the need to retrain task-specific models on the same dataset. To approximate the posterior of a Bayesian inverse problem, a diffusion model samples from a sequence of intermediate posterior distributions, each with an intractable likelihood function. This work proposes a novel mixture approximation of these intermediate distributions. Since direct gradient-based sampling of these mixtures is infeasible due to intractable terms, we propose a practical method based on Gibbs sampling. We validate our approach through extensive experiments on image inverse problems, utilizing both pixel- and latent-space diffusion priors, as well as on source separation with an audio diffusion model. The code is available at this https URL

[LG-76] Is In-Context Universality Enough? MLPs are Also Universal In-Context

链接: https://arxiv.org/abs/2502.03327
作者: Anastasis Kratsios,Takashi Furuya
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The success of transformers is often linked to their ability to perform in-context learning. Recent work shows that transformers are universal in context, capable of approximating any real-valued continuous function of a context (a probability measure over \mathcalX\subseteq \mathbbR^d ) and a query x\in \mathcalX . This raises the question: Does in-context universality explain their advantage over classical models? We answer this in the negative by proving that MLPs with trainable activation functions are also universal in-context. This suggests the transformer’s success is likely due to other factors like inductive bias or training stability.

[LG-77] CARROT: A Cost Aware Rate Optimal Router

链接: https://arxiv.org/abs/2502.03261
作者: Seamus Somerstep,Felipe Maia Polo,Allysson Flavio Melo de Oliveira,Prattyush Mangal,Mírian Silva,Onkar Bhardwaj,Mikhail Yurochkin,Subha Maity
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:With the rapid growth in the number of Large Language Models (LLMs), there has been a recent interest in LLM routing, or directing queries to the cheapest LLM that can deliver a suitable response. Following this line of work, we introduce CARROT, a Cost AwaRe Rate Optimal rouTer that can select models based on any desired trade-off between performance and cost. Given a query, CARROT selects a model based on estimates of models’ cost and performance. Its simplicity lends CARROT computational efficiency, while our theoretical analysis demonstrates minimax rate-optimality in its routing performance. Alongside CARROT, we also introduce the Smart Price-aware Routing (SPROUT) dataset to facilitate routing on a wide spectrum of queries with the latest state-of-the-art LLMs. Using SPROUT and prior benchmarks such as Routerbench and open-LLM-leaderboard-v2 we empirically validate CARROT’s performance against several alternative routers.

[LG-78] From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

链接: https://arxiv.org/abs/2502.03210
作者: Noa Rubin,Kirsten Fischer,Javed Lindner,David Dahmen,Inbar Seroussi,Zohar Ringel,Michael Krämer,Moritz Helias
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 6 figures

点击查看摘要

Abstract:Theoretically describing feature learning in neural networks is crucial for understanding their expressive power and inductive biases, motivating various approaches. Some approaches describe network behavior after training through a simple change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving complex directional changes to the kernel. While these approaches capture different facets of network behavior, their relationship and respective strengths across scaling regimes remains an open question. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these approaches. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network’s probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output of a linear network. However, even in this case, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.

[LG-79] SimSort: A Powerful Framework for Spike Sorting by Large-Scale Electrophysiology Simulation

链接: https://arxiv.org/abs/2502.03198
作者: Yimu Zhang,Dongqi Han,Yansen Wang,Yu Gu,Dongsheng Li
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spike sorting is an essential process in neural recording, which identifies and separates electrical signals from individual neurons recorded by electrodes in the brain, enabling researchers to study how specific neurons communicate and process information. Although there exist a number of spike sorting methods which have contributed to significant neuroscientific breakthroughs, many are heuristically designed, making it challenging to verify their correctness due to the difficulty of obtaining ground truth labels from real-world neural recordings. In this work, we explore a data-driven, deep learning-based approach. We begin by creating a large-scale dataset through electrophysiology simulations using biologically realistic computational models. We then present \textbfSimSort, a pretraining framework for spike sorting. Remarkably, when trained on our simulated dataset, SimSort demonstrates strong zero-shot generalization to real-world spike sorting tasks, significantly outperforming existing methods. Our findings underscore the potential of data-driven techniques to enhance the reliability and scalability of spike sorting in experimental neuroscience.

[LG-80] Signature Reconstruction from Randomized Signatures

链接: https://arxiv.org/abs/2502.03163
作者: Mie Glückstad,Nicola Muca Cirone,Josef Teichmann
类目: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 37 pages, 7 figures

点击查看摘要

Abstract:Controlled ordinary differential equations driven by continuous bounded variation curves can be considered a continuous time analogue of recurrent neural networks for the construction of expressive features of the input curves. We ask up to which extent well known signature features of such curves can be reconstructed from controlled ordinary differential equations with (untrained) random vector fields. The answer turns out to be algebraically involved, but essentially the number of signature features, which can be reconstructed from the non-linear flow of the controlled ordinary differential equation, is exponential in its hidden dimension, when the vector fields are chosen to be neural with depth two. Moreover, we characterize a general linear independence condition on arbitrary vector fields, under which the signature features up to some fixed order can always be reconstructed. Algebraically speaking this complements in a quantitative manner several well known results from the theory of Lie algebras of vector fields and puts them in a context of machine learning.

[LG-81] Fast Sampling of Cosmological Initial Conditions with Gaussian Neural Posterior Estimation

链接: https://arxiv.org/abs/2502.03139
作者: Oleg Savchenko,Guillermo Franco Abellán,Florian List,Noemi Anau Montel,Christoph Weniger
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 9 + 2 pages, 7 figures, 1 table. Comments welcome!

点击查看摘要

Abstract:Knowledge of the primordial matter density field from which the large-scale structure of the Universe emerged over cosmic time is of fundamental importance for cosmology. However, reconstructing these cosmological initial conditions from late-time observations is a notoriously difficult task, which requires advanced cosmological simulators and sophisticated statistical methods to explore a multi-million-dimensional parameter space. We show how simulation-based inference (SBI) can be used to tackle this problem and to obtain data-constrained realisations of the primordial dark matter density field in a simulation-efficient way with general non-differentiable simulators. Our method is applicable to full high-resolution dark matter N -body simulations and is based on modelling the posterior distribution of the constrained initial conditions to be Gaussian with a diagonal covariance matrix in Fourier space. As a result, we can generate thousands of posterior samples within seconds on a single GPU, orders of magnitude faster than existing methods, paving the way for sequential SBI for cosmological fields. Furthermore, we perform an analytical fit of the estimated dependence of the covariance on the wavenumber, effectively transforming any point-estimator of initial conditions into a fast sampler. We test the validity of our obtained samples by comparing them to the true values with summary statistics and performing a Bayesian consistency test.

[LG-82] Comparison of the Cox proportional hazards model and Random Survival Forest algorithm for predicting patient-specific survival probabilities in clinical trial data

链接: https://arxiv.org/abs/2502.03119
作者: Ricarda Graf,Susan Todd,M. Fazil Baksh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Cox proportional hazards model is often used for model development in data from randomized controlled trials (RCT) with time-to-event outcomes. Random survival forests (RSF) is a machine-learning algorithm known for its high predictive performance. We conduct a comprehensive neutral comparison study to compare the predictive performance of Cox regression and RSF in real-world as well as simulated data. Performance is compared using multiple performance measures according to recommendations for the comparison of prognostic prediction models. We found that while the RSF usually outperforms the Cox model when using the C index, Cox model predictions may be better calibrated. With respect to overall performance, the Cox model often exceeds the RSF in nonproportional hazards settings, while otherwise the RSF typically performs better especially for smaller sample sizes. Overall performance of the RSF is more affected by higher censoring rates, while overall performance of the Cox model suffers more from smaller sample sizes.

[LG-83] A Bayesian perspective on single-shot laser characterization

链接: https://arxiv.org/abs/2502.03100
作者: J. Esslinger,N. Weisse,C. Eberle,J. Schroeder,S. Howard,P. Norreys,S. Karsch,A. Döpp
类目: Optics (physics.optics); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:We introduce a Bayesian framework for measuring spatio-temporal couplings (STCs) in ultra-intense lasers that reconceptualizes what constitutes a ‘single-shot’ measurement. Moving beyond traditional distinctions between single- and multi-shot devices, our approach provides rigorous criteria for determining when measurements can truly resolve individual laser shots rather than statistical averages. This framework shows that single-shot capability is not an intrinsic device property but emerges from the relationship between measurement precision and inherent parameter variability. Implementing this approach with a new measurement device at the ATLAS-3000 petawatt laser, we provide the first quantitative uncertainty bounds on pulse front tilt and curvature. Notably, we observe that our Bayesian method reduces uncertainty by up to 60% compared to traditional approaches. Through this analysis, we reveal how the interplay between measurement precision and intrinsic system variability defines achievable resolution – insights that have direct implications for applications where precise control of laser-matter interaction is critical.

[LG-84] me Series Anomaly Detection in the Frequency Domain with Statistical Reliability

链接: https://arxiv.org/abs/2502.03062
作者: Akifumi Yamada,Tomohiro Shiraishi,Shuichi Nishino,Teruyuki Katsuoka,Kouichi Taji,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective anomaly detection in complex systems requires identifying change points (CPs) in the frequency domain, as abnormalities often arise across multiple frequencies. This paper extends recent advancements in statistically significant CP detection, based on Selective Inference (SI), to the frequency domain. The proposed SI method quantifies the statistical significance of detected CPs in the frequency domain using p -values, ensuring that the detected changes reflect genuine structural shifts in the target system. We address two major technical challenges to achieve this. First, we extend the existing SI framework to the frequency domain by appropriately utilizing the properties of discrete Fourier transform (DFT). Second, we develop an SI method that provides valid p -values for CPs where changes occur across multiple frequencies. Experimental results demonstrate that the proposed method reliably identifies genuine CPs with strong statistical guarantees, enabling more accurate root-cause analysis in the frequency domain of complex systems.

[LG-85] An analysis of optimization problems involving ReLU neural networks

链接: https://arxiv.org/abs/2502.03016
作者: Christoph Plate,Mirko Hahn,Alexander Klimek,Caroline Ganzer,Kai Sundmacher,Sebastian Sager
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving mixed-integer optimization problems with embedded neural networks with ReLU activation functions is challenging. Big-M coefficients that arise in relaxing binary decisions related to these functions grow exponentially with the number of layers. We survey and propose different approaches to analyze and improve the run time behavior of mixed-integer programming solvers in this context. Among them are clipped variants and regularization techniques applied during training as well as optimization-based bound tightening and a novel scaling for given ReLU networks. We numerically compare these approaches for three benchmark problems from the literature. We use the number of linear regions, the percentage of stable neurons, and overall computational effort as indicators. As a major takeaway we observe and quantify a trade-off between the often desired redundancy of neural network models versus the computational costs for solving related optimization problems.

[LG-86] Building Bridges between Regression Clustering and Classification

链接: https://arxiv.org/abs/2502.02996
作者: Lawrence Stewart(DI-ENS, LIENS, Inria),Francis Bach(LIENS, SIERRA),Quentin Berthet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regression, the task of predicting a continuous scalar target y based on some features x is one of the most fundamental tasks in machine learning and statistics. It has been observed and theoretically analyzed that the classical approach, meansquared error minimization, can lead to suboptimal results when training neural networks. In this work, we propose a new method to improve the training of these models on regression tasks, with continuous scalar targets. Our method is based on casting this task in a different fashion, using a target encoder, and a prediction decoder, inspired by approaches in classification and clustering. We showcase the performance of our method on a wide range of real-world datasets.

[LG-87] Data denoising with self consistency variance maximization and the Kantorovich dominance

链接: https://arxiv.org/abs/2502.02925
作者: Joshua Zoen-Git Hiew,Tongseok Lim,Brendan Pass,Marcelo Cruz de Souza
类目: Methodology (stat.ME); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We introduce a new framework for data denoising, partially inspired by martingale optimal transport. For a given noisy distribution (the data), our approach involves finding the closest distribution to it among all distributions which 1) have a particular prescribed structure (expressed by requiring they lie in a particular domain), and 2) are self-consistent with the data. We show that this amounts to maximizing the variance among measures in the domain which are dominated in convex order by the data. For particular choices of the domain, this problem and a relaxed version of it, in which the self-consistency condition is removed, are intimately related to various classical approaches to denoising. We prove that our general problem has certain desirable features: solutions exist under mild assumptions, have certain robustness properties, and, for very simple domains, coincide with solutions to the relaxed problem. We also introduce a novel relationship between distributions, termed Kantorovich dominance, which retains certain aspects of the convex order while being a weaker, more robust, and easier-to-verify condition. Building on this, we propose and analyze a new denoising problem by substituting the convex order in the previously described framework with Kantorovich dominance. We demonstrate that this revised problem shares some characteristics with the full convex order problem but offers enhanced stability, greater computational efficiency, and, in specific domains, more meaningful solutions. Finally, we present simple numerical examples illustrating solutions for both the full convex order problem and the Kantorovich dominance problem. Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST) Cite as: arXiv:2502.02925 [stat.ME] (or arXiv:2502.02925v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2502.02925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-88] AI-driven materials design: a mini-review

链接: https://arxiv.org/abs/2502.02905
作者: Mouyang Cheng,Chu-Liang Fu,Ryotaro Okabe,Abhijatmedhi Chotrattanapituk,Artittaya Boonkird,Nguyen Tuan Hung,Mingda Li
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures, 1 table; Review article

点击查看摘要

Abstract:Materials design is an important component of modern science and technology, yet traditional approaches rely heavily on trial-and-error and can be inefficient. Computational techniques, enhanced by modern artificial intelligence (AI), have greatly accelerated the design of new materials. Among these approaches, inverse design has shown great promise in designing materials that meet specific property requirements. In this mini-review, we summarize key computational advancements for materials design over the past few decades. We follow the evolution of relevant materials design techniques, from high-throughput forward machine learning (ML) methods and evolutionary algorithms, to advanced AI strategies like reinforcement learning (RL) and deep generative models. We highlight the paradigm shift from conventional screening approaches to inverse generation driven by deep generative models. Finally, we discuss current challenges and future perspectives of materials inverse design. This review may serve as a brief guide to the approaches, progress, and outlook of designing future functional materials with technological relevance.

[LG-89] Uncertainty Quantification with the Empirical Neural Tangent Kernel

链接: https://arxiv.org/abs/2502.02870
作者: Joseph Wilson,Chris van der Heide,Liam Hodgkinson,Fred Roosta
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 5 figures, 9 tables

点击查看摘要

Abstract:While neural networks have demonstrated impressive performance across various tasks, accurately quantifying uncertainty in their predictions is essential to ensure their trustworthiness and enable widespread adoption in critical systems. Several Bayesian uncertainty quantification (UQ) methods exist that are either cheap or reliable, but not both. We propose a post-hoc, sampling-based UQ method for over-parameterized networks at the end of training. Our approach constructs efficient and meaningful deep ensembles by employing a (stochastic) gradient-descent sampling process on appropriately linearized networks. We demonstrate that our method effectively approximates the posterior of a Gaussian process using the empirical Neural Tangent Kernel. Through a series of numerical experiments, we show that our method not only outperforms competing approaches in computational efficiency (often reducing costs by multiple factors) but also maintains state-of-the-art performance across a variety of UQ metrics for both regression and classification tasks.

[LG-90] Algorithms with Calibrated Machine Learning Predictions

链接: https://arxiv.org/abs/2502.02861
作者: Judy Shen,Ellen Vitercik,Anders Wikum
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of algorithms with predictions incorporates machine learning advice in the design of online algorithms to improve real-world performance. While this theoretical framework often assumes uniform reliability across all predictions, modern machine learning models can now provide instance-level uncertainty estimates. In this paper, we propose calibration as a principled and practical tool to bridge this gap, demonstrating the benefits of calibrated advice through two case studies: the ski rental and online job scheduling problems. For ski rental, we design an algorithm that achieves optimal prediction-dependent performance and prove that, in high-variance settings, calibrated advice offers more effective guidance than alternative methods for uncertainty quantification. For job scheduling, we demonstrate that using a calibrated predictor leads to significant performance improvements over existing methods. Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions.

[LG-91] Gap-Dependent Bounds for Federated Q-learning

链接: https://arxiv.org/abs/2502.02859
作者: Haochen Zhang,Zhong Zheng,Lingzhou Xue
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the first gap-dependent analysis of regret and communication cost for on-policy federated Q -Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to \sqrtT -type regret bounds and communication cost bounds with a \log T term scaling with the number of agents M , states S , and actions A , where T is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a \log T -type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on MSA from the \log T term. Notably, our gap-dependent communication cost bound also yields a better global switching cost when M=1 , removing SA from the \log T term.

[LG-92] Achievable distributional robustness when the robust risk is only partially identified

链接: https://arxiv.org/abs/2502.02710
作者: Julia Kostin,Nicola Gnecco,Fanny Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In safety-critical applications, machine learning models should generalize well under worst-case distribution shifts, that is, have a small robust risk. Invariance-based algorithms can provably take advantage of structural assumptions on the shifts when the training distributions are heterogeneous enough to identify the robust risk. However, in practice, such identifiability conditions are rarely satisfied – a scenario so far underexplored in the theoretical literature. In this paper, we aim to fill the gap and propose to study the more general setting when the robust risk is only partially identifiable. In particular, we introduce the worst-case robust risk as a new measure of robustness that is always well-defined regardless of identifiability. Its minimum corresponds to an algorithm-independent (population) minimax quantity that measures the best achievable robustness under partial identifiability. While these concepts can be defined more broadly, in this paper we introduce and derive them explicitly for a linear model for concreteness of the presentation. First, we show that existing robustness methods are provably suboptimal in the partially identifiable case. We then evaluate these methods and the minimizer of the (empirical) worst-case robust risk on real-world gene expression data and find a similar trend: the test error of existing robustness methods grows increasingly suboptimal as the fraction of data from unseen environments increases, whereas accounting for partial identifiability allows for better generalization.

[LG-93] hree-dimensional signal processing: a new approach in dynamical sampling via tensor products

链接: https://arxiv.org/abs/2502.02684
作者: Yisen Wang,Hanqin Cai,Longxiu Huang
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The dynamical sampling problem is centered around reconstructing signals that evolve over time according to a dynamical process, from spatial-temporal samples that may be noisy. This topic has been thoroughly explored for one-dimensional signals. Multidimensional signal recovery has also been studied, but primarily in scenarios where the driving operator is a convolution operator. In this work, we shift our focus to the dynamical sampling problem in the context of three-dimensional signal recovery, where the evolution system can be characterized by tensor products. Specifically, we provide a necessary condition for the sampling set that ensures successful recovery of the three-dimensional signal. Furthermore, we reformulate the reconstruction problem as an optimization task, which can be solved efficiently. To demonstrate the effectiveness of our approach, we include some straightforward numerical simulations that showcase the reconstruction performance.

[LG-94] Networks with Finite VC Dimension: Pro and Contra

链接: https://arxiv.org/abs/2502.02679
作者: Vera Kurkova,Marcello Sanguineti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Approximation and learning of classifiers of large data sets by neural networks in terms of high-dimensional geometry and statistical learning theory are investigated. The influence of the VC dimension of sets of input-output functions of networks on approximation capabilities is compared with its influence on consistency in learning from samples of data. It is shown that, whereas finite VC dimension is desirable for uniform convergence of empirical errors, it may not be desirable for approximation of functions drawn from a probability distribution modeling the likelihood that they occur in a given type of application. Based on the concentration-of-measure properties of high dimensional geometry, it is proven that both errors in approximation and empirical errors behave almost deterministically for networks implementing sets of input-output functions with finite VC dimensions in processing large data sets. Practical limitations of the universal approximation property, the trade-offs between the accuracy of approximation and consistency in learning from data, and the influence of depth of networks with ReLU units on their accuracy and consistency are discussed.

[LG-95] Regret-Optimized Portfolio Enhancement through Deep Reinforcement Learning and Future Looking Rewards

链接: https://arxiv.org/abs/2502.02619
作者: Daniil Karzanov,Rubén Garzón,Mikhail Terekhov,Caglar Gulcehre,Thomas Raffinot,Marcin Detyniecki
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:This paper introduces a novel agent-based approach for enhancing existing portfolio strategies using Proximal Policy Optimization (PPO). Rather than focusing solely on traditional portfolio construction, our approach aims to improve an already high-performing strategy through dynamic rebalancing driven by PPO and Oracle agents. Our target is to enhance the traditional 60/40 benchmark (60% stocks, 40% bonds) by employing the Regret-based Sharpe reward function. To address the impact of transaction fee frictions and prevent signal loss, we develop a transaction cost scheduler. We introduce a future-looking reward function and employ synthetic data training through a circular block bootstrap method to facilitate the learning of generalizable allocation strategies. We focus on two key evaluation measures: return and maximum drawdown. Given the high stochasticity of financial markets, we train 20 independent agents each period and evaluate their average performance against the benchmark. Our method not only enhances the performance of the existing portfolio strategy through strategic rebalancing but also demonstrates strong results compared to other baselines.

信息检索

[IR-0] Investigating Corporate Social Responsibility Initiatives: Examining the case of corporate Covid-19 response

链接: https://arxiv.org/abs/2502.03421
作者: Meheli Basu,Aniruddha Dutta,Purvi Shah
类目: Information Retrieval (cs.IR)
*备注: 7 Tables

点击查看摘要

Abstract:In todays age of freely available information, policy makers have to take into account a huge amount of information while making decisions affecting relevant stakeholders. While increase in the amount of information sources and documents increases credibility of decisions based on the corpus of available text, it is challenging for policymakers to make sense of this information. This paper demonstrates how policy makers can implement some of the most popular topic recognition methods, Latent Dirichlet Allocation, Deep Distributed Representation method, text summarization approaches, Word Based Sentence Ranking method and TextRank for sentence extraction method, to sum up the content of large volume of documents to understand the gist of the overload of information. We have applied popular NLP methods to corporate press releases during the early period and advanced period of Covid-19 pandemic which has resulted in a global unprecedented health and socio-economic crisis, when policymaking and regulations have become especially important to standardize corporate practices for employee and social welfare in the face of similar future unseen crises. The steps undertaken in this study can be replicated to yield insights from relevant documents in any other social decision-making context.

[IR-1] DenseReviewer: A Screening Prioritisation Tool for Systematic Review based on Dense Retrieval ECIR2025

链接: https://arxiv.org/abs/2502.03400
作者: Xinyu Mao,Teerapong Leelanupab,Harrisen Scells,Guido Zuccon
类目: Information Retrieval (cs.IR)
*备注: Accepted at ECIR 2025

点击查看摘要

Abstract:Screening is a time-consuming and labour-intensive yet required task for medical systematic reviews, as tens of thousands of studies often need to be screened. Prioritising relevant studies to be screened allows downstream systematic review creation tasks to start earlier and save time. In previous work, we developed a dense retrieval method to prioritise relevant studies with reviewer feedback during the title and abstract screening stage. Our method outperforms previous active learning methods in both effectiveness and efficiency. In this demo, we extend this prior work by creating (1) a web-based screening tool that enables end-users to screen studies exploiting state-of-the-art methods and (2) a Python library that integrates models and feedback mechanisms and allows researchers to develop and demonstrate new active learning methods. We describe the tool’s design and showcase how it can aid screening. The tool is available at this https URL. The source code is also open sourced at this https URL.

[IR-2] Interactive Visualization Recommendation with Hier-SUCB

链接: https://arxiv.org/abs/2502.03375
作者: Songwen Hu,Ryan A. Rossi,Tong Yu,Junda Wu,Handong Zhao,Sungchul Kim,Shuai Li
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Visualization recommendation aims to enable rapid visual analysis of massive datasets. In real-world scenarios, it is essential to quickly gather and comprehend user preferences to cover users from diverse backgrounds, including varying skill levels and analytical tasks. Previous approaches to personalized visualization recommendations are non-interactive and rely on initial user data for new users. As a result, these models cannot effectively explore options or adapt to real-time feedback. To address this limitation, we propose an interactive personalized visualization recommendation (PVisRec) system that learns on user feedback from previous interactions. For more interactive and accurate recommendations, we propose Hier-SUCB, a contextual combinatorial semi-bandit in the PVisRec setting. Theoretically, we show an improved overall regret bound with the same rank of time but an improved rank of action space. We further demonstrate the effectiveness of Hier-SUCB through extensive experiments where it is comparable to offline methods and outperforms other bandit algorithms in the setting of visualization recommendation.

[IR-3] Intent Representation Learning with Large Language Model for Recommendation

链接: https://arxiv.org/abs/2502.03307
作者: Yu Wang,Lei Sang,Yi Zhang,Yiwen Zhang
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Intent-based recommender systems have garnered significant attention for uncovering latent fine-grained preferences. Intents, as underlying factors of interactions, are crucial for improving recommendation interpretability. Most methods define intents as learnable parameters updated alongside interactions. However, existing frameworks often overlook textual information (e.g., user reviews, item descriptions), which is crucial for alleviating the sparsity of interaction intents. Exploring these multimodal intents, especially the inherent differences in representation spaces, poses two key challenges: i) How to align multimodal intents and effectively mitigate noise issues; ii) How to extract and match latent key intents across modalities. To tackle these challenges, we propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), which leverages large language models (LLMs) to construct multimodal intents and enhance recommendations. Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations. Next, we propose pairwise and translation alignment to eliminate inter-modal differences and enhance robustness against noisy input features. Finally, to better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations. Empirical evaluations on three datasets show that our IRLLRec framework outperforms baselines. The implementation is available at this https URL.

[IR-4] Data Dams: A Novel Framework for Regulating and Managing Data Flow in Large-Scale Systems

链接: https://arxiv.org/abs/2502.03218
作者: Mohamed Aly Bouke,Azizol Abdullah,Korhan Cengiz,Nikola Ivković,Ivan Mihaljević,Mudathir Ahmed Mohamud,Ahmed Kowrina
类目: Information Retrieval (cs.IR); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In the era of big data, managing dynamic data flows efficiently is crucial as traditional storage models struggle with real-time regulation and risk overflow. This paper introduces Data Dams, a novel framework designed to optimize data inflow, storage, and outflow by dynamically adjusting flow rates to prevent congestion while maximizing resource utilization. Inspired by physical dam mechanisms, the framework employs intelligent sluice controls and predictive analytics to regulate data flow based on system conditions such as bandwidth availability, processing capacity, and security constraints. Simulation results demonstrate that the Data Dam significantly reduces average storage levels (371.68 vs. 426.27 units) and increases total outflow (7999.99 vs. 7748.76 units) compared to static baseline models. By ensuring stable and adaptive outflow rates under fluctuating data loads, this approach enhances system efficiency, mitigates overflow risks, and outperforms existing static flow control strategies. The proposed framework presents a scalable solution for dynamic data management in large-scale distributed systems, paving the way for more resilient and efficient real-time processing architectures.

[IR-5] Scientometric Analysis of the German IR Community within TREC CLEF

链接: https://arxiv.org/abs/2502.03065
作者: A. K. Kruff,P. Schaer
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:Within this study, the influence of the German Information Retrieval community on the retrieval campaigns Text Retrieval Conference (TREC) and Conference and Labs of the Evaluation Forum (CLEF) between 2000 and 2022 was analyzed based on metadata provided by OpenAlex and further metadata extracted with the GROBID framework from the publication’s full texts. The analysis was conducted at the institutional and researcher levels. It was found that the German IR community, both on the author and institution level, mainly contributed to CLEF. Furthermore, it was shown that productivity follows the assumptions made by Lotka’s Law.

[IR-6] FuXi-alpha: Scaling Recommendation Model with Feature Interaction Enhanced Transformer WWW2025

链接: https://arxiv.org/abs/2502.03036
作者: Yufei Ye,Wei Guo,Jin Yao Chin,Hao Wang,Hong Zhu,Xi Lin,Yuyang Ye,Yong Liu,Ruiming Tang,Defu Lian,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW2025

点击查看摘要

Abstract:Inspired by scaling laws and large language models, research on large-scale recommendation models has gained significant attention. Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy. Current state-of-the-art sequential recommendation models primarily use self-attention mechanisms for explicit feature interactions among items, while implicit interactions are managed through Feed-Forward Networks (FFNs). However, these models often inadequately integrate temporal and positional information, either by adding them to attention weights or by blending them with latent representations, which limits their expressive power. A recent model, HSTU, further reduces the focus on implicit feature interactions, constraining its performance. We propose a new model called FuXi- \alpha to address these issues. This model introduces an Adaptive Multi-channel Self-attention mechanism that distinctly models temporal, positional, and semantic features, along with a Multi-stage FFN to enhance implicit feature interactions. Our offline experiments demonstrate that our model outperforms existing models, with its performance continuously improving as the model size increases. Additionally, we conducted an online A/B test within the Huawei Music app, which showed a 4.76% increase in the average number of songs played per user and a 5.10% increase in the average listening duration per user. Our code has been released at this https URL.

[IR-7] Assessing Research Impact in Indian Conference Proceedings: Insights from Collaboration and Citations

链接: https://arxiv.org/abs/2502.02997
作者: Kiran Sharma,Parul Khurana
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conferences serve as a crucial avenue for scientific communication. However, the increase in conferences and the subsequent publication of proceedings have prompted inquiries regarding the research quality being showcased at such events. This investigation delves into the conference publications indexed by Springer’s Lecture Notes in Networks and Systems Series. Among the 570 international conferences held worldwide in this series, 177 were exclusively hosted in India. These 177 conferences collectively published 11,066 papers as conference proceedings. All these publications, along with conference details, were sourced from the Scopus database. The study aims to evaluate the research impact of these conference proceedings and identify the primary contributors. The results reveal a downward trend in the average number of citations per year. The collective average citation for all publications is 1.01. Papers co-authored by Indian and international authors (5.6%) exhibit a higher average impact of 1.44, in contrast to those authored solely by Indian authors (84.9%), which have an average impact of 0.97. Notably, Indian-collaborated papers, among the largest contributors, predominantly originate from private colleges and universities. Only 19% of papers exhibit collaboration with institutes of different prestige, yet their impact is considerably higher as compared to collaboration with institutes of similar prestige. This study highlights the importance of improving research quality in academic forums.

[IR-8] Control Search Rankings Control the World: What is a Good Search Engine?

链接: https://arxiv.org/abs/2502.02957
作者: Simon Coghlan,Hui Xian Chia,Falk Scholer,Damiano Spina
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: Accepted to Springer’s AI and Ethics journal on February 4, 2025; 31 pages, 1 figure

点击查看摘要

Abstract:This paper examines the ethical question, ‘What is a good search engine?’ Since search engines are gatekeepers of global online information, it is vital they do their job ethically well. While the Internet is now several decades old, the topic remains under-explored from interdisciplinary perspectives. This paper presents a novel role-based approach involving four ethical models of types of search engine behavior: Customer Servant, Librarian, Journalist, and Teacher. It explores these ethical models with reference to the research field of information retrieval, and by means of a case study involving the COVID-19 global pandemic. It also reflects on the four ethical models in terms of the history of search engine development, from earlier crude efforts in the 1990s, to the very recent prospect of Large Language Model-based conversational information seeking systems taking on the roles of established web search engines like Google. Finally, the paper outlines considerations that inform present and future regulation and accountability for search engines as they continue to evolve. The paper should interest information retrieval researchers and others interested in the ethics of search engines.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-06

目录

概览 (2025-02-06)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载