Arxiv今日论文 | 2025-02-24

本篇博文主要内容为 2025-02-24 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在探讨个人可识别信息（PII）在大规模语言模型（LLM）训练过程中的记忆动态特性及其演变规律。研究发现，PII的记忆量和易记性是一个随训练流程变化的动态属性，并依赖于设计选择的常见调整。关键在于揭示了三种新型现象：(1) 训练后期出现的相似PII可以引发早期PII的辅助记忆，这在某些情况下可占到1/3的比例；(2) 添加PII显著增加了其他PII的记忆量，最高可达约7.5倍；(3) 移除PII可能导致其他PII被记住。这些发现提示模型创建者在训练模型时应考虑一阶和二阶隐私风险，以避免新PII的泄露。

链接: https://arxiv.org/abs/2502.15680
作者: Jaydeep Borkar,Matthew Jagielski,Katherine Lee,Niloofar Mireshghallah,David A. Smith,Christopher A. Choquette-Choo
机构: Northeastern University; Google DeepMind; University of Washington
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 23 pages, 26 figures

点击查看摘要

Abstract:Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training. Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage. We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices. We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII significantly (in our settings, as much as \approx!7.5\times ); and (3) removing PII can lead to other PII being memorized. Model creators should consider these first- and second-order privacy risks when training models to avoid the risk of new PII regurgitation.
zh

[NLP-1] FLEKE: Federated Locate-then-Edit Knowledge Editing

【速读】：该论文旨在解决在多客户端场景下，现有Locate-then-Edit Knowledge Editing (LEKE) 方法因冗余的Mediator Knowledge Vector (MKV) 计算和隐私问题而导致的低效性。解决方案的关键在于提出了一种名为Federated Locate-then-Edit Knowledge Editing (FLEKE) 的新型任务，并设计了一个名为FedEdit的两阶段框架，优化了MKV的选择与重用。FedEdit允许客户端基于余弦相似度检索相关的MKV，从而实现知识的重新编辑，减少了冗余计算，同时保护了隐私。

链接: https://arxiv.org/abs/2502.15677
作者: Zongkai Zhao,Guozeng Xu,Xiuhua Li,Kaiwen Wei,Jiang Zhong
机构: School of Big Data & Software Engineering, Chongqing University, China (大数据与软件工程学院，重庆大学，中国); College of Computer Science, Chongqing University, China (计算机科学学院，重庆大学，中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns. To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse. In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations. Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FLEKE task with our FedEdit framework. Our code is available at this https URL.
zh

[NLP-2] AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind

【速读】：该论文旨在解决现有方法在实现开放域机器认知理论（Theory of Mind, ToM）推理中的局限性。当前方法要么依赖于大型语言模型（Large Language Models, LLMs），但这些模型容易产生系统性错误；要么采用手工构建的贝叶斯认知理论（Bayesian Theory of Mind, BToM）模型，虽然这类模型更为稳健但无法跨不同领域泛化。论文的关键解决方案是引入AutoToM，这是一种自动化的贝叶斯认知理论方法，能够应用于任意领域，推断任何心理变量，并进行任意阶次的稳健认知推理。AutoToM通过提出初始BToM模型，基于提议模型执行自动化贝叶斯逆向规划，并利用大型语言模型作为后端来迭代优化模型，从而不断减少推理的不确定性。实验评估表明，AutoToM在多个认知理论基准测试中表现出一致的最先进性能，提供了一种可扩展、稳健且可解释的方法来实现机器的认知理论。

链接: https://arxiv.org/abs/2502.15676
作者: Zhining Zhang,Chuanyang Jin,Mung Yao Jia,Tianmin Shu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 6 figures, 11 tables. Website at this https URL

点击查看摘要

Abstract:Theory of Mind (ToM), the ability to understand people’s mental variables based on their behavior, is key to developing socially intelligent agents. Current approaches to Theory of Mind reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use rigid, handcrafted Bayesian Theory of Mind (BToM) models, which are more robust but cannot generalize across different domains. In this work, we introduce AutoToM, an automated Bayesian Theory of Mind method for achieving open-ended machine Theory of Mind. AutoToM can operate in any domain, infer any mental variable, and conduct robust Theory of Mind reasoning of any order. Given a Theory of Mind inference problem, AutoToM first proposes an initial BToM model. It then conducts automated Bayesian inverse planning based on the proposed model, leveraging an LLM as the backend. Based on the uncertainty of the inference, it iteratively refines the model, by introducing additional mental variables and/or incorporating more timesteps in the context. Empirical evaluations across multiple Theory of Mind benchmarks demonstrate that AutoToM consistently achieves state-of-the-art performance, offering a scalable, robust, and interpretable approach to machine Theory of Mind.
zh

[NLP-3] Almost AI Almost Human: The Challenge of Detecting AI-Polished Writing

【速读】：该论文旨在解决AI-polished文本（AI-polished text）的检测难题，即经过轻微AI润色的人类撰写文本被误认为完全由AI生成的问题。论文的关键在于通过构建一个包含11.7K个样本的数据集（AI-Polished-Text Evaluation, APT-Eval），涵盖不同程度的AI参与，来系统性评估现有的十一款最先进的AI文本检测器。研究发现，现有检测器常错误地将轻微润色的文本判定为AI生成，并且难以区分不同水平的AI参与度，同时存在对旧模型和小型模型的偏见。这些局限性凸显了需要更精细的检测方法的需求。

链接: https://arxiv.org/abs/2502.15666
作者: Shoumik Saha,Soheil Feizi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 17 pages, 17 figures

点击查看摘要

Abstract:The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains 11.7K samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.
zh

[NLP-4] Machine-generated text detection prevents language model collapse

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中因包含未知比例的合成数据而导致的模型崩溃问题。模型崩溃是一种退化过程，会导致模型不断强化自身的错误并导致性能下降。论文的关键解决方案在于设计了一种基于重要性权重重新采样数据分布的方法，通过机器生成文本检测器来确定这些权重。这种方法被验证能够有效防止模型崩溃，并且在训练数据集中包含足够的人类编写数据时，还能提升模型性能。

链接: https://arxiv.org/abs/2502.15654
作者: George Drayson,Vasileios Lampos
机构: Centre for Artificial Intelligence (人工智能中心); Department of Computer Science (计算机科学系); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since web data is the primary resource for LLM pretraining, future models will be trained on an unknown portion of synthetic data. This will lead to model collapse, a degenerative process which causes models to reinforce their own errors and experience a drop in model performance. In this study, we investigate the impact of decoding strategy on model collapse, where we analyse the characteristics of the generated data during recursive training, its similarity to human references and the resulting model performance. Using the decoding strategies that lead to the most significant model degradation, we tackle the question: how to avoid model collapse when the origin (human or synthetic) of the training data is unknown. We design a novel methodology based on resampling the data distribution using importance weights from our machine-generated text detector. Our method is validated on two LLM variants (GPT-2 and SmolLM2) on the open-ended text generation task, demonstrating that we can successfully prevent model collapse and when there is enough human-authored data in the training dataset, our method improves model performance.
zh

[NLP-5] Empowering LLM s with Logical Reasoning : A Comprehensive Survey

【速读】：该论文旨在解决大型语言模型（LLMs）在逻辑推理能力方面存在的显著挑战。具体而言，这些挑战包括逻辑问答中的复杂推理失败以及不同问题间逻辑一致性的问题。为了应对这些问题，论文的关键在于将现有方法分类，并提出详细的分类法。针对复杂逻辑问题的准确解答，这些方法依据对外部求解器、提示（prompts）、预训练和微调的依赖进行分类。为避免逻辑矛盾，论文讨论了各种逻辑一致性的概念与解决方案，如蕴含（implication）、否定（negation）、传递性（transitivity）、事实一致性（factuality consistency）及其复合形式。此外，论文还回顾了常用基准数据集和评估指标，并探讨了有前景的研究方向，如扩展模态逻辑以处理不确定性及开发同时满足多种逻辑一致性的高效算法。

链接: https://arxiv.org/abs/2502.15652
作者: Fengxiang Cheng,Haoxuan Li,Fenrong Liu,Robert van Rooij,Kun Zhang,Zhouchen Lin
机构: Institute for Logic, Language and Computation, University of Amsterdam (逻辑、语言与计算研究所，阿姆斯特丹大学); Center for Data Science, Peking University (数据科学中心，北京大学); Machine Learning Department, MBZUAI (机器学习系，MBZUAI); Department of Philosophy, Tsinghua University (哲学系，清华大学); Department of Philosophy, CMU (哲学系，CMU); Institute for Artificial Intelligence, Peking University (人工智能研究所，北京大学); Peng Cheng Laboratory (鹏城实验室); National Key Lab of General AI, School of Intelligence Science and Technology, Peking University (通用人工智能国家重点实验室，智能科学与技术学院，北京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable successes on various natural language tasks. However, recent studies have found that there are still significant challenges to the logical reasoning abilities of LLMs. This paper summarizes and categorizes the main challenges into two aspects: (1) Logical question answering, LLMs often fail to generate the correct answer within complex logical problem which requires sophisticated deductive, inductive or abductive reasoning given a collection of premises and constrains. (2) Logical consistency, LLMs are prone to producing responses contradicting themselves across different questions. For example, a state-of-the-art Macaw question-answering LLM answers Yes to both questions Is a magpie a bird? and Does a bird have wings? but answers No to Does a magpie have wings?. To facilitate this research direction, we comprehensively investigate the most cutting-edge methods and propose detailed taxonomies of these methods. Specifically, to accurately answer complex logic questions, previous methods can be categorized based on reliance on external solvers, prompts, pretraining, and fine-tuning. To avoid logical contradictions, we discuss concepts and solutions of various logical consistencies, including implication, negation, transitivity, factuality consistency, and their composites. In addition, we review commonly used benchmark datasets and evaluation metrics, and discuss promising research directions, such as extensions to modal logic to account for uncertainty, and efficient algorithms satisfying multiple logical consistencies simultaneously.
zh

[NLP-6] Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models

【速读】：该论文旨在解决跨语言表示在多语言大语言模型（mLLMs）中的对齐问题，特别是在缺乏大规模语言数据和高计算成本的情况下。论文的关键解决方案是采用模型干预（model interventions），通过操纵模型激活来引导生成方向，从而实现对特定语言神经元的调整，以增强跨语言表示的对齐效果，并最终提升下游检索任务的性能，最高可提高一倍的Top-1准确率。

链接: https://arxiv.org/abs/2502.15639
作者: Anirudh Sundar,Sinead Williamson,Katherine Metcalf,Barry-John Theobald,Skyler Seto,Masha Fedzechkina
机构: Apple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages

点击查看摘要

Abstract:Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions – a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM’s activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.
zh

[NLP-7] Extraction multi-étiquettes de relations en utilisant des couches de Transformer

【速读】：该论文旨在解决法语文本中多标签关系抽取的问题。解决方案的关键在于提出了BTransformer18模型，该模型结合了BERT家族预训练语言模型（如CamemBERT和FlauBERT）的上下文表示能力与Transformer编码器捕捉token之间长期依赖性的能力。实验结果表明，使用CamemBERT-Large版本的BTransformer18模型在TextMine’25挑战数据集上的宏F1分数达到了0.654，优于FlauBERT-Large的表现，从而验证了该方法的有效性。

链接: https://arxiv.org/abs/2502.15619
作者: Ngoc Luyen Le,Gildas Tagny Ngompé
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in French language

点击查看摘要

Abstract:In this article, we present the BTransformer18 model, a deep learning architecture designed for multi-label relation extraction in French texts. Our approach combines the contextual representation capabilities of pre-trained language models from the BERT family - such as BERT, RoBERTa, and their French counterparts CamemBERT and FlauBERT - with the power of Transformer encoders to capture long-term dependencies between tokens. Experiments conducted on the dataset from the TextMine’25 challenge show that our model achieves superior performance, particularly when using CamemBERT-Large, with a macro F1 score of 0.654, surpassing the results obtained with FlauBERT-Large. These results demonstrate the effectiveness of our approach for the automatic extraction of complex relations in intelligence reports.
zh

[NLP-8] Probe Pruning: Accelerating LLM s through Dynamic Pruning via Model-Probing ICLR2025

【速读】：该论文旨在解决在线动态结构化剪枝大型语言模型（Large Language Models, LLMs）的问题。解决方案的关键在于引入了探针剪枝（Probe Pruning, PP）框架，通过少量样本和标记的有效探测来识别关键权重，并基于历史状态进行有策略的剪枝。这种方法能够在不增加额外神经网络模块或微调的情况下，显著提升LLMs的结构化剪枝效率。具体而言，仅使用1.5%的浮点运算（FLOPs）即可大幅提高LLMs的剪枝效率。

链接: https://arxiv.org/abs/2502.15618
作者: Qi Le,Enmao Diao,Ziyan Wang,Xinran Wang,Jie Ding,Li Yang,Ali Anwar
机构: University of Minnesota(明尼苏达大学); University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2025

点击查看摘要

Abstract:We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model’s output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at this https URL.
zh

[NLP-9] Pastiche Novel Generation Creating: Fan Fiction You Love in Your Favorite Authors Style

【速读】：该论文旨在解决当前小说生成方法依赖简略且单一的故事大纲，并使用平庸通用语言生成细节的问题。论文提出的关键解决方案是Pastiche小说生成任务，要求生成的小说能够模仿原作的独特特征，包括理解角色设定、预测合理的剧情发展以及使用生动具体的语言撰写细节。为此，论文引入了WriterAgent系统，通过课程学习范式从基础的语言风格掌握逐步过渡到高级的叙事连贯性，确保全面的叙事控制。WriterAgent利用了WriterLoRA框架，该框架扩展了LoRA，包含层级化和累积性的任务特定模块，每个模块专注于不同的叙事方面。

链接: https://arxiv.org/abs/2502.15616
作者: Xueran Han,Yuhan Liu,Mingzhe Li,Wei Liu,Sen Hu,Rui Yan,Zhiqiang Xu,Xiuying Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Great novels create immersive worlds with rich character arcs, well-structured plots, and nuanced writing styles. However, current novel generation methods often rely on brief, simplistic story outlines and generate details using plain, generic language. To bridge this gap, we introduce the task of Pastiche Novel Generation, which requires the generated novels to imitate the distinctive features of the original work, including understanding character profiles, predicting plausible plot developments, and writing concrete details using vivid, expressive language. To achieve this, we propose WriterAgent, a novel generation system designed to master the core aspects of literary pastiche. WriterAgent is trained through a curriculum learning paradigm, progressing from low-level stylistic mastery to high-level narrative coherence. Its key tasks include language style learning, character modeling, plot planning, and stylish writing, ensuring comprehensive narrative control. To support this, WriterAgent leverages the WriterLoRA framework, an extension of LoRA with hierarchical and cumulative task-specific modules, each specializing in a different narrative aspect. We evaluate WriterAgent on multilingual classics like Harry Potter and Dream of the Red Chamber, demonstrating its superiority over baselines in capturing the target author’s settings, character dynamics, and writing style to produce coherent, faithful narratives.
zh

[NLP-10] LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models

【速读】：该论文旨在解决Mamba等状态空间模型（State Space Models, SSMs）在长上下文序列建模中的高效性与缺乏可解释性工具之间的矛盾。解决方案的关键在于引入LaTIM，这是一种新颖的令牌级分解方法，能够实现Mamba-1和Mamba-2的细粒度可解释性，从而揭示模型内部的令牌间交互模式。

链接: https://arxiv.org/abs/2502.15612
作者: Hugo Pitorro,Marcos Treviso
机构: Instituto de Telecomunicações (电信研究所), Lisbon (里斯本)
类目: Computation and Language (cs.CL)
备注: 8 pages, 10 figures in the main paper

点击查看摘要

Abstract:State space models (SSMs), such as Mamba, have emerged as an efficient alternative to transformers for long-context sequence modeling. However, despite their growing adoption, SSMs lack the interpretability tools that have been crucial for understanding and improving attention-based architectures. While recent efforts provide insights into Mamba’s internal mechanisms, they do not explicitly decompose token-wise contributions, leaving gaps in understanding how Mamba selectively processes sequences across layers. In this work, we introduce LaTIM, a novel token-level decomposition method for both Mamba-1 and Mamba-2 that enables fine-grained interpretability. We extensively evaluate our method across diverse tasks, including machine translation, copying, and retrieval-based generation, demonstrating its effectiveness in revealing Mamba’s token-to-token interaction patterns.
zh

[NLP-11] On the Robustness of Transformers against Context Hijacking for Linear Classification

【速读】：该论文旨在解决Transformer-based Large Language Models (LLMs)在面对事实正确上下文时预测能力受损的问题，即所谓的“上下文劫持”(context hijacking)，这揭示了一个重要的鲁棒性问题。研究的关键在于通过理论分析和实验验证，发现较深的Transformer模型能够实现更高的鲁棒性。具体而言，更深的模型层允许进行更精细的优化步骤，从而有效减轻来自上下文劫持的干扰。这一结论与实证观察结果一致，并得到了数值实验的支持。

链接: https://arxiv.org/abs/2502.15609
作者: Tianle Li,Chenyang Zhang,Xingwu Chen,Yuan Cao,Difan Zou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.
zh

[NLP-12] Do Multilingual LLM s Think In English?

【速读】：该论文旨在探究大型语言模型（LLMs）在多语言处理任务中的决策过程，并发现这些模型的关键决策是在与英语最接近的表示空间中进行的，无论输入和输出的语言为何。论文的关键解决方案在于使用对数镜头（logit lens）分析法，揭示了LLMs在处理不同语言句子时，首先会生成接近英语的表示形式，然后再将其翻译成目标语言。此外，论文指出，当调整向量在英语而非输入输出语言中计算时，激活引导（activation steering）的效果更佳。这表明多语言LLMs的关键推理步骤是在一种被英语强烈影响的表示空间中进行的，这种过程对系统用户来说并不透明。

链接: https://arxiv.org/abs/2502.15603
作者: Lisa Schut,Yarin Gal,Sebastian Farquhar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Main paper 9 pages; including appendix 48 pages

点击查看摘要

Abstract:Large language models (LLMs) have multilingual capabilities and can solve tasks across various languages. However, we show that current LLMs make key decisions in a representation space closest to English, regardless of their input and output languages. Exploring the internal representations with a logit lens for sentences in French, German, Dutch, and Mandarin, we show that the LLM first emits representations close to English for semantically-loaded words before translating them into the target language. We further show that activation steering in these LLMs is more effective when the steering vectors are computed in English rather than in the language of the inputs and outputs. This suggests that multilingual LLMs perform key reasoning steps in a representation that is heavily shaped by English in a way that is not transparent to system users.
zh

[NLP-13] Robust Bias Detection in MLMs and its Application to Human Trait Ratings NAACL2025

【速读】：该论文旨在解决先前研究在评估语言模型（Language Models, LM）中的偏差时所存在的局限性，包括忽视模板和目标概念的随机变异性、假定模板间的平等性以及忽略偏差的量化。论文的关键解决方案在于提出了一种系统性的统计方法，通过混合模型（mixed models）来考虑随机效应，并利用伪困惑度权重（pseudo-perplexity weights）量化句子中的偏差，从而使用统计效应量（effect sizes）精确衡量偏差。这种方法不仅重现了先前研究的结果，而且进一步探讨了七种语言模型（基础版和大型版）在人格和性格特征方面针对性别偏差的新问题。

链接: https://arxiv.org/abs/2502.15600
作者: Ingroj Shrestha,Louis Tay,Padmini Srinivasan
机构: University of Iowa (艾奥瓦大学); Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注: To appear at Findings of NAACL 2025

点击查看摘要

Abstract:There has been significant prior work using templates to study bias against demographic attributes in MLMs. However, these have limitations: they overlook random variability of templates and target concepts analyzed, assume equality amongst templates, and overlook bias quantification. Addressing these, we propose a systematic statistical approach to assess bias in MLMs, using mixed models to account for random effects, pseudo-perplexity weights for sentences derived from templates and quantify bias using statistical effect sizes. Replicating prior studies, we match on bias scores in magnitude and direction with small to medium effect sizes. Next, we explore the novel problem of gender bias in the context of \textitpersonality and \textitcharacter traits, across seven MLMs (base and large). We find that MLMs vary; ALBERT is unbiased for binary gender but the most biased for non-binary \textitneo , while RoBERTa-large is the most biased for binary gender but shows small to no bias for \textitneo . There is some alignment of MLM bias and findings in psychology (human perspective) - in \textitagreeableness with RoBERTa-large and \textitemotional stability with BERT-large. There is general agreement for the remaining 3 personality dimensions: both sides observe at most small differences across gender. For character traits, human studies on gender bias are limited thus comparisons are not feasible.
zh

[NLP-14] SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention

【速读】：该论文旨在解决大型语言模型（LLMs）在实际部署中面临的因Jailbreak攻击而导致的行为不合规问题。此类攻击利用模型漏洞诱导出有害行为，而现有的防御方法往往无法同时保证有效性和高效性。论文的关键解决方案是提出SafeIntervention (SafeInt)，这是一种通过安全感知的表示干预来保护LLMs免受Jailbreak攻击的新方法。SafeInt通过对Jailbreak样本表示的动态调整，使其与不安全样本的表示对齐，同时最小化对无关表示的扰动，从而确保了防御的有效性和效率。

链接: https://arxiv.org/abs/2502.15594
作者: Jiaqi Wu,Chen Chen,Chunyan Hou,Xiaojie Yuan
机构: Nankai University (南开大学); Tianjin University of Technology (天津理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the widespread real-world deployment of large language models (LLMs), ensuring their behavior complies with safety standards has become crucial. Jailbreak attacks exploit vulnerabilities in LLMs to induce undesirable behavior, posing a significant threat to LLM safety. Previous defenses often fail to achieve both effectiveness and efficiency simultaneously. Defenses from a representation perspective offer new insights, but existing interventions cannot dynamically adjust representations based on the harmfulness of the queries. To address this limitation while ensuring both effectiveness and efficiency, we propose SafeIntervention (SafeInt), a novel defense method that shields LLMs from jailbreak attacks through safety-aware representation intervention. SafeInt is built on our analysis of the representations of jailbreak samples. It adjusts representation distributions of jailbreak samples through intervention to align them with the representations of unsafe samples while minimizing unnecessary perturbations to jailbreak-irrelevant representations. We conduct comprehensive experiments covering six jailbreak attacks, two jailbreak datasets, and two utility benchmarks. Experimental results demonstrate that SafeInt outperforms all baselines in defending LLMs against jailbreak attacks while largely maintaining utility. Additionally, we evaluate SafeInt against adaptive attacks and verify its effectiveness in mitigating real-time attacks.
zh

[NLP-15] Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

【速读】：该论文旨在解决长上下文预训练模型在后训练阶段所需指令数据的设计问题，特别是如何高效且有效地利用不同长度和类型的上下文。研究的关键在于发现经过短上下文指令调优的模型能够有效泛化到较长的上下文中，并确定了指令难度和上下文构成等其他重要因素。基于这些发现，论文提出了一种名为上下文合成（context synthesis）的新颖数据合成框架，该框架利用现成的大语言模型生成扩展背景上下文，以创建高质量的指令-答案对。实验结果表明，所提出的方案优于先前的指令合成方法，并接近于人工标注的长上下文指令数据的表现。

链接: https://arxiv.org/abs/2502.15592
作者: Wenhao Zhu,Pinzhen Chen,Hanxu Hu,Shujian Huang,Fei Yuan,Jiajun Chen,Alexandra Birch
机构: National Key Laboratory for Novel Software Technology, Nanjing University(南京大学国家重点软件技术实验室); School of Informatics, University of Edinburgh(爱丁堡大学信息学院); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); University of Zurich(苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: this https URL.
zh

[NLP-16] LightThinker: Thinking Step-by-Step Compression

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂推理任务中的高内存和计算成本问题，特别是在生成较长序列时。解决方案的关键在于提出了一种名为LightThinker的新方法，它通过动态压缩推理过程中的中间思维步骤来实现这一目标。LightThinker受到人类认知过程的启发，将冗长的思维步骤压缩成紧凑的表示形式，并丢弃原始的推理链，从而显著减少了存储在上下文窗口中的标记数量。这种方法通过数据构造、隐藏状态到浓缩主旨标记的映射以及创建专门的注意力掩码来训练模型何时及如何进行压缩。此外，引入了依赖性（Dep）度量指标来量化压缩的程度。实验结果表明，LightThinker在保持竞争力的准确性的同时，能够减少峰值内存使用和推理时间。

链接: https://arxiv.org/abs/2502.15589
作者: Jintian Zhang,Yuqi Zhu,Mengshu Sun,Yujie Luo,Shuofei Qiao,Lun Du,Da Zheng,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code will be released at this https URL.
zh

[NLP-17] Chats-Grid: An Iterative Retrieval QA Optimization Scheme Leverag ing Large Model and Retrieval Enhancement Generation in smart grid

【速读】：该论文旨在解决智能电网环境中传统检索增强生成（RAG）问答系统所面临的挑战，包括检索质量不足、回复不相关以及处理大规模实时数据流的低效性。解决方案的关键在于提出了一种名为Chats-Grid的优化迭代检索型问答框架。该框架通过高级查询扩展确保在预检索阶段全面覆盖多种数据源，并结合BM25稀疏检索和BGE密集检索有效处理大规模异构数据集。后检索阶段采用经过微调的大语言模型进行提示工程以评估相关性、过滤无关结果并重新排序文档，最终生成精确且上下文感知的答案，同时通过自检机制提高可靠性。

链接: https://arxiv.org/abs/2502.15583
作者: Yunfeng Li,Jiqun Zhang,Guofu Liao,Xue Shi,Junhong Liu
机构: Department of Electronics and Information Engineering, Shenzhen University, China (电子与信息工程系, 深圳大学, 中国); Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR, China (电气与电子工程系, 香港大学, 中国香港特别行政区)
类目: Computation and Language (cs.CL)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:With rapid advancements in artificial intelligence, question-answering (QA) systems have become essential in intelligent search engines, virtual assistants, and customer service platforms. However, in dynamic domains like smart grids, conventional retrieval-augmented generation(RAG) QA systems face challenges such as inadequate retrieval quality, irrelevant responses, and inefficiencies in handling large-scale, real-time data streams. This paper proposes an optimized iterative retrieval-based QA framework called Chats-Grid tailored for smart grid environments. In the pre-retrieval phase, Chats-Grid advanced query expansion ensures comprehensive coverage of diverse data sources, including sensor readings, meter records, and control system parameters. During retrieval, Best Matching 25(BM25) sparse retrieval and BAAI General Embedding(BGE) dense retrieval in Chats-Grid are combined to process vast, heterogeneous datasets effectively. Post-retrieval, a fine-tuned large language model uses prompt engineering to assess relevance, filter irrelevant results, and reorder documents based on contextual accuracy. The model further generates precise, context-aware answers, adhering to quality criteria and employing a self-checking mechanism for enhanced reliability. Experimental results demonstrate Chats-Grid’s superiority over state-of-the-art methods in fidelity, contextual recall, relevance, and accuracy by 2.37%, 2.19%, and 3.58% respectively. This framework advances smart grid management by improving decision-making and user interactions, fostering resilient and adaptive smart grid infrastructures.
zh

[NLP-18] Interpreting and Steering LLM s with Mutual Information-based Explanations on Sparse Autoencoders

【速读】：该论文旨在解决大型语言模型（LLMs）在处理人类查询时偶尔产生错误或意外响应的问题。论文的关键在于通过改进稀疏自动编码器（SAEs）特征的解释方法来更好地捕捉语义概念，而不是仅仅强调语言模式。为此，作者提出使用固定词汇集进行特征解释，并设计了一种基于互信息的目标函数，以更有效地捕获这些特征背后的语义意义。此外，还提出了两种运行时引导策略，根据相应解释调整已学习的特征激活。这些解决方案共同提高了对LLMs行为的解释能力，并有效防御了越狱攻击。

链接: https://arxiv.org/abs/2502.15576
作者: Xuansheng Wu,Jiayi Yuan,Wenlin Yao,Xiaoming Zhai,Ninghao Liu
机构: University of Georgia(乔治亚大学); Rice University(莱斯大学); Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注: Pre-print. 20 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.
zh

[NLP-19] A Survey of QUD Models for Discourse Processing NAACL2025

【速读】：该论文旨在探讨如何在自然语言处理领域应用问题讨论框架（Question Under Discussion, QUD），重点关注其在书面文本中的实现。关键在于总结并分析现有模型，并考察QUD与主流话语框架（如Rhetorical Structure Theory, RST；PropBank Discourse Treebank, PDTB；and Segmented Discourse Representation Theory, SDRT）之间的关系。

链接: https://arxiv.org/abs/2502.15573
作者: Yingxue Fu
机构: University of St Andrews (圣安德鲁斯大学)
类目: Computation and Language (cs.CL)
备注: accepted to the main conference of NAACL2025

点击查看摘要

Abstract:Question Under Discussion (QUD), which is originally a linguistic analytic framework, gains increasing attention in the community of natural language processing over the years. Various models have been proposed for implementing QUD for discourse processing. This survey summarizes these models, with a focus on application to written texts, and examines studies that explore the relationship between QUD and mainstream discourse frameworks, including RST, PDTB and SDRT. Some questions that may require further study are suggested.
zh

[NLP-20] DReSD: Dense Retrieval for Speculative Decoding

【速读】：该论文旨在解决现有 speculative decoding (SD) 方法在加速大型语言模型 (LLM) 生成过程中存在的低效问题。目前，基于稀疏检索 (Sparse Retrieval, REST) 的方法因其简单性和可扩展性而被广泛采用，但其效果受限于短上下文和精确字符串匹配的局限性。论文提出的关键解决方案是引入 Dense Retrieval for Speculative Decoding (DReSD)，这是一种使用上下文化的 token 嵌入进行近似最近邻搜索的新框架，以检索最语义相关的 token 序列。实验结果表明，DReSD 在接受率、生成长度和速度方面显著优于稀疏检索 (REST)，分别提高了 87%、65% 和 19%。

链接: https://arxiv.org/abs/2502.15572
作者: Milan Gritta,Huiyin Xue,Gerasimos Lampouras
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室), London, UK; University of Sheffield (谢菲尔德大学), UK
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (REST), which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).
zh

[NLP-21] Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation

【速读】：该论文旨在解决现有视觉语言模型（Visual Language Models, VLM）基准测试在跨领域性能比较和特定领域评估方面的局限性。论文的关键解决方案在于提出了一种资源高效的方法来创建特定领域的VLM基准测试，通过任务增强技术从单一现有任务中生成多个多样化任务，并发布了七个领域的新VLM基准测试，包括162,946个经过严格人工验证的答案。此外，论文还对22种最先进的VLM在总共37,171项任务上的表现进行了广泛的评估，揭示了不同领域和任务之间的性能差异，从而支持了定制化VLM基准测试的需求。

链接: https://arxiv.org/abs/2502.15563
作者: Tim Rädsch,Leon Mayer,Simon Pavicic,A. Emre Kavur,Marcel Knopp,Barış Öztürk,Klaus Maier-Hein,Paul F. Jaeger,Fabian Isensee,Annika Reinke,Lena Maier-Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains and tasks, thereby supporting the need for tailored VLM benchmarks. Adoption of our methodology will pave the way for the resource-efficient domain-specific selection of models and guide future research efforts toward addressing core open questions.
zh

[NLP-22] PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning

【速读】：该论文旨在解决知识增强生成（KAG）过程中内部知识与外部信息之间的冲突问题。当前方法主要集中在改进外部知识的利用，但效果有限，因为内部知识仍然会影响大型语言模型（LLMs）的生成过程。论文的关键解决方案是提出了一种基于参数修剪的知识增强生成方法（PIP-KAG），通过修剪LLMs的内部知识并引入一个即插即用的适应模块，帮助LLMs更好地利用外部资源。此外，论文构建了一个名为CoConflictQA的基准测试集，用于评估在回答问题时上下文的真实性。实验结果表明，PIP-KAG显著减少了知识冲突，并提高了上下文保真度，同时减少了13%的参数量，增强了LLMs在KAG框架中的参数效率。

链接: https://arxiv.org/abs/2502.15543
作者: Pengcheng Huang,Zhenghao Liu,Yukun Yan,Xiaoyuan Yi,Hao Chen,Zhiyuan Liu,Maosong Sun,Tong Xiao,Ge Yu,Chenyan Xiong
机构: Department of Computer Science and Technology, Northeastern University, China (东北大学计算机科学与技术系,中国);
Department of Computer Science and Technology, Institute for AI, Tsinghua University, China (清华大学计算机科学与技术系,中国);
Microsoft Research Asia, Beijing, China (微软亚洲研究院,中国);
Language Technologies Institute, Carnegie Mellon University, United States (卡内基梅隆大学语言技术研究所,美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures, 7 tables

点击查看摘要

Abstract:Knowledge-Augmented Generation (KAG) has shown great promise in updating the internal memory of Large Language Models (LLMs) by integrating external knowledge. However, KAG inevitably faces knowledge conflicts when the internal memory contradicts external information. Current approaches to mitigating these conflicts mainly focus on improving external knowledge utilization. However, these methods have shown only limited effectiveness in mitigating the knowledge conflict problem, as internal knowledge continues to influence the generation process of LLMs. In this paper, we propose a ParametrIc Pruning-based Knowledge-Augmented Generation (PIP-KAG) approach, which prunes internal knowledge of LLMs and incorporates a plug-and-play adaptation module to help LLMs better leverage external sources. Additionally, we construct the CoConflictQA benchmark based on the hallucination of LLMs to better evaluate contextual faithfulness during answering questions. Experimental results on CoConflictQA demonstrate that PIP-KAG significantly reduces knowledge conflicts and improves context fidelity. Notably, PIP-KAG reduces LLM’s parameters by 13%, enhancing parameter efficiency in LLMs within the KAG framework. All codes are available at this https URL.
zh

[NLP-23] SOTOPIA-Ω: Dynamic Strategy Injection Learning and Social Instrucion Following Evaluation for Social Agents

【速读】：本文旨在解决人类社会策略在社交代理中的转移与整合不足的问题。关键解决方案在于提出SOTOPIA-\Omega框架，该框架通过动态注入基于谈判理论的多步推理策略及两种直接策略，自动化构建高质量的社交对话训练语料库，进而显著提升语言代理的社交能力，并引入Social Instruction Following (S-IF)概念及其评估指标，验证了动态构建方法的优势，特别是在突破代理长期僵局方面。

链接: https://arxiv.org/abs/2502.15538
作者: Wenyuan Zhang,Tianyun Liu,Mengxiao Song,Xiaodong Li,Tingwen Liu
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 26 pages, 5 figures, 23 tables

点击查看摘要

Abstract:Despite the abundance of prior social strategies possessed by humans, there remains a paucity of research dedicated to their transfer and integration into social agents. Our proposed SOTOPIA-\Omega framework aims to address and bridge this gap, with a particular focus on enhancing the social capabilities of language agents. This framework dynamically injects multi-step reasoning strategies inspired by negotiation theory, along with two simple direct strategies, into expert agents, thereby automating the construction of high-quality social dialogue training corpus. Additionally, we introduce the concept of Social Instruction Following (S-IF) and propose two new S-IF evaluation metrics that are complementary to social capability. We demonstrate that several 7B models trained on high-quality corpus not only significantly surpass the expert agent (GPT-4) in achieving social goals but also enhance S-IF performance. Analysis and variant experiments validate the advantages of dynamic construction, which can especially break the agent’s prolonged deadlock.
zh

[NLP-24] Activation Steering in Neural Theorem Provers

【速读】：该论文旨在解决大型语言模型（LLMs）在使用证明助手如Lean进行形式定理证明时，难以准确预测证明中的下一步，尤其是在候选战术排序方面的挑战。关键解决方案在于采用激活引导（activation steering）技术来指导LLMs的响应，从而在推理阶段改进生成结果。研究结果表明，激活引导提供了一种轻量级的替代方法，无需专门的微调，即可增强LLMs在定理证明中的能力，尤其适用于资源受限的环境。

链接: https://arxiv.org/abs/2502.15507
作者: Shashank Kirtania
机构: Microsoft(微软), India
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean. However, current state of the art language models struggles to predict next step in proofs leading practitioners to use different sampling techniques to improve LLMs capabilities. We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics, affecting the overall selection process. To overcome this hurdle, we use activation steering to guide LLMs responses to improve the generations at the time of inference. Our results suggest that activation steering offers a promising lightweight alternative to specialized fine-tuning for enhancing theorem proving capabilities in LLMs, particularly valuable in resource-constrained environments.
zh

[NLP-25] Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）预训练过程中的训练稳定性问题，特别是在Post-Norm Transformers等架构中常见的梯度爆炸和梯度消失问题。论文提出的关键解决方案是Scale-Distribution Decoupling (SDD)，通过显式地解耦全连接层中权重矩阵的规模和分布，引入归一化机制来调节激活函数，并采用可学习的缩放向量以保持梯度条件良好，从而有效防止梯度爆炸和消失。这种分离显著提高了优化效率，尤其是在深层网络中，确保了梯度传播的稳定性。

链接: https://arxiv.org/abs/2502.15499
作者: Ya Wang,Zhijian Zhuo,Yutao Zeng,Xun Zhou,Jian Yang,Xiaoqing Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing \textbfgradient explosion and dissipation . This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at this https URL.
zh

[NLP-26] ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models ACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在显式因果推理任务中的准确性不足问题。为了解决这一问题，作者引入了一个名为ExpliCa的新数据集，该数据集不仅整合了因果关系和时间关系，并且这些关系以不同的语言顺序呈现且明确通过语言连接词表达。通过使用提示（prompting）和困惑度（perplexity）为基础的评估指标测试多个LLMs，研究揭示了即使是顶级模型在该任务上的准确率也难以达到0.80。关键在于通过ExpliCa数据集，论文系统地分析了模型在处理因果与时间关系混淆及语言顺序影响方面的局限性。

链接: https://arxiv.org/abs/2502.15487
作者: Martina Miliani,Serenna Auriemma,Alessandro Bondielli,Emmanuele Chersoni,Lucia Passaro,Irene Sucameli,Alessandro Lenci
机构: CoLing Lab, Department of Philology, Literature, and Linguistics, University of Pisa(比萨大学); Department of Informatics, University of Pisa(比萨大学); Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University(香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to ACL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.
zh

[NLP-27] Enhancing RWKV-based Language Models for Long-Sequence Text Generation

【速读】：该论文旨在解决长序列文本处理中的挑战，特别是在生成式任务中捕捉长期依赖性的问题。解决方案的关键在于引入了一种自适应token移位和门控机制，以更好地捕获文本生成中的远距离依赖关系。

链接: https://arxiv.org/abs/2502.15485
作者: Xinghan Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 tables, 3 figures

点击查看摘要

Abstract:This paper presents an enhanced RWKV-based language generation model designed to improve long-sequence text processing. We propose an adaptive token shift and gating mechanism to better capture long-range dependencies in text generation. Through a series of experiments, we compare the baseline RWKV model with the enhanced model, evaluating performance in terms of forward propagation time, text generation quality, and automatic evaluation metrics such as perplexity, BLEU, and ROUGE. Experimental results show that the enhanced model significantly improves generation quality, especially in BLEU and ROUGE scores, and demonstrates stronger context-capturing ability in long-text generation tasks.
zh

[NLP-28] A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLM s

【速读】：该论文旨在解决MoE (Mixture-of-Experts)架构在大规模语言模型预训练过程中出现的专家负载不平衡问题，这会导致路由崩溃或计算开销增加。解决方案的关键在于提出了一种基于二元整数规划（Binary Integer Programming, BIP）的专家负载均衡算法——BIP-Based Balancing。该算法通过维护一个额外的向量q，并以极小的时间成本解决一个二元整数规划问题来改变s的前K排序，从而实现专家负载的有效平衡，同时保持预训练效率。

链接: https://arxiv.org/abs/2502.15451
作者: Yuan Sun
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:MoE (Mixture-of-Expert) architectures appear frequently in large language models, and the number of experts can be over one hundred recently. However, the expert load imbalance problem always happens in MoE model pre-training, which will cause routing collapse or increased computational overhead. In order to balance loads on experts, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q that can help change the top-K order of s by solving a binary integer programming with very small time costs. In simulation experiments, we observe that BIP-Based Balancing make imbalance disappoint very fast, while the final sum of routine scores decreases very little. Our algorithm achieves nearly perfect trade-off between expert load balance and pre-training efficiency under the simulation view.
zh

[NLP-29] When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在内存受限设备上的部署挑战，特别是量化后的LLMs仍然需要大量内存的问题。论文的关键解决方案在于提出了一种压缩框架，通过压缩感知量化（compression-aware quantization）重新缩放模型参数以增强权重的可压缩性，并结合剪枝方法进一步提高压缩效果。此外，为了克服解压过程中的瓶颈问题，论文还提出了一种自适应加速方法。实验结果显示，使用压缩模型进行推理可以将内存大小减少40%，同时保持精度和推理速度几乎不受影响。

链接: https://arxiv.org/abs/2502.15443
作者: Weilan Wang,Yu Mao,Dongdong Tang,Hongchao Du,Nan Guan,Chun Jason Xue
机构: City University Of Hong Kong(香港城市大学); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德 bin 扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.
zh

[NLP-30] Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

【速读】：该论文旨在解决联邦学习中使用低秩适应（Low-Rank Adaptation, LoRA）进行高效微调大规模语言模型（LLMs）时所面临的挑战。现有方法要么通信成本过高且与客户端数量线性相关，要么性能下降。论文的关键解决方案是引入Federated Silver Bullet (Fed-SB)，它采用了一种新颖的低秩适应方法LoRA-SB。LoRA-SB通过在适配器B和A之间学习一个小的方矩阵R，来最优地对齐优化轨迹与理想的低秩完整微调投影，从而实现直接平均R以保证精确更新。这种方法显著降低了通信成本，并保持通信开销与客户端数量无关，提高了可扩展性。Fed-SB在多种任务中实现了最先进的性能，同时将通信成本降低了高达230倍。

链接: https://arxiv.org/abs/2502.15436
作者: Raghav Singhal,Kaustubh Ponkshe,Rohit Vartak,Lav R. Varshney,Praneeth Vepakomma
机构: Mohamed bin Zayed University of Artificial Intelligence(穆罕默德 bin Zayed 人工智能大学); Duke University(杜克大学); University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校); Massachusetts Institute of Technology(麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Raghav Singhal and Kaustubh Ponkshe contributed equally to this work

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix ® between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at this https URL.
zh

[NLP-31] Single-pass Detection of Jailbreaking Input in Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在面对越狱攻击（jailbreaking attacks）时的防御问题。现有方法通常需要多次请求或查询辅助LLMs，导致计算开销较大。论文的关键解决方案是单次传递检测（Single Pass Detection, SPD）方法，它利用logits中的信息来预测输出句子是否具有危害性，从而实现在一次前向传递中完成检测与防御。这种方法不仅能够有效识别开放源代码模型中的攻击，还能减少无害输入的误分类，并且在不完全访问logits的情况下，在GPT-3.5和GPT-4中依然保持有效性。

链接: https://arxiv.org/abs/2502.15435
作者: Leyla Naz Candogan,Yongtao Wu,Elias Abad Rocamora,Grigorios G. Chrysos,Volkan Cevher
机构: LIONS - École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院); University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in TMLR 2025

点击查看摘要

Abstract:Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.
zh

[NLP-32] Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation

【速读】：该论文旨在解决现有模型合并方法受限于固定参数合并比例的问题。解决方案的关键在于提出了一种名为Mixup Model Merge (M^3) 的新方法，该方法通过随机生成线性插值比例来合并两个大型语言模型 (LLMs) 的参数，从而实现更灵活和全面的参数空间探索。

链接: https://arxiv.org/abs/2502.15434
作者: Yue Zhou,Yi Chang,Yuan Wu
机构: School of Artificial Intelligence, Jilin University (吉林大学人工智能学院); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (中国教育部知识驱动人机智能工程研究中心); International Center of Future Science, Jilin University (吉林大学未来科学国际中心)
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Model merging integrates the parameters of multiple models into a unified model, combining their diverse capabilities. Existing model merging methods are often constrained by fixed parameter merging ratios. In this study, we propose Mixup Model Merge (M ^3 ), an innovative approach inspired by the Mixup data augmentation technique. This method merges the parameters of two large language models (LLMs) by randomly generating linear interpolation ratios, allowing for a more flexible and comprehensive exploration of the parameter space. Extensive experiments demonstrate the superiority of our proposed M ^3 method in merging fine-tuned LLMs: (1) it significantly improves performance across multiple tasks, (2) it enhances LLMs’ out-of-distribution (OOD) robustness and adversarial robustness, (3) it achieves superior results when combined with sparsification techniques such as DARE, and (4) it offers a simple yet efficient solution that does not require additional computational resources. In conclusion, M ^3 is a simple yet effective model merging method that significantly enhances the performance of the merged model by randomly generating contribution ratios for two fine-tuned LLMs. The code is available at this https URL.
zh

[NLP-33] Pub-Guard-LLM : Detecting Fraudulent Biomedical Articles with Reliable Explanations

【速读】：该论文旨在解决生物医学科学文章中日益增长的欺诈行为问题，威胁到研究的可信度和安全性。解决方案的关键在于提出Pub-Guard-LLM，这是一种基于大型语言模型（Large Language Model, LLM）的系统，专门用于检测生物医学科学文章中的欺诈行为，并提供三种应用模式：直接推理（vanilla reasoning）、检索增强生成（retrieval-augmented generation）和多智能体辩论（multi-agent debate）。这些模式均允许生成文本形式的预测解释，从而提高检测性能和可解释性。

链接: https://arxiv.org/abs/2502.15429
作者: Lihu Chen,Shuojie Fu,Gabriel Freedman,Cemre Zor,Guy Martin,James Kinross,Uddhav Vaghela,Ovidiu Serban,Francesca Toni
机构: Imperial College London (帝国理工学院); Amazon Web Services (亚马逊网络服务); National Health Service (英国国家医疗服务体系)
类目: Computation and Language (cs.CL)
备注: long paper under review

点击查看摘要

Abstract:A significant and growing number of published scientific articles is found to involve fraudulent practices, posing a serious threat to the credibility and safety of research in fields such as medicine. We propose Pub-Guard-LLM, the first large language model-based system tailored to fraud detection of biomedical scientific articles. We provide three application modes for deploying Pub-Guard-LLM: vanilla reasoning, retrieval-augmented generation, and multi-agent debate. Each mode allows for textual explanations of predictions. To assess the performance of our system, we introduce an open-source benchmark, PubMed Retraction, comprising over 11K real-world biomedical articles, including metadata and retraction labels. We show that, across all modes, Pub-Guard-LLM consistently surpasses the performance of various baselines and provides more reliable explanations, namely explanations which are deemed more relevant and coherent than those generated by the baselines when evaluated by multiple assessment methods. By enhancing both detection performance and explainability in scientific fraud detection, Pub-Guard-LLM contributes to safeguarding research integrity with a novel, effective, open-source tool.
zh

[NLP-34] Evaluating Multimodal Generative AI with Korean Educational Standards NAACL2025

【速读】：该论文旨在解决评估多模态生成式 AI 系统（Multimodal Generative AI Systems）性能的问题，特别是在处理不同教育水平的复杂性和多样性方面。关键解决方案是提出了韩国国家教育测试基准（KoNET），它包含四个基于韩国国家标准的考试：小学（KoEGED）、中学（KoMGED）、高中（KoHGED）以及大学入学考试（KoCSAT）。这些考试以其严格的评分标准和多样化的问题著称，能够全面分析AI在不同教育阶段的表现。通过使用韩语，KoNET 还提供了关于模型在较少探索的语言中的表现的见解。

链接: https://arxiv.org/abs/2502.15422
作者: Sanghee Park,Geewook Kim
机构: NAVER Cloud AI; KAIST AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages; To appear at NAACL 2025 Main Conference (Project page: this https URL )

点击查看摘要

Abstract:This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at this https URL.
zh

[NLP-35] Beyond Translation: LLM -Based Data Generation for Multilingual Fact-Checking

【速读】：该论文旨在解决多语言环境下自动事实核查系统的鲁棒性问题，特别是在低资源语言中的应用。解决方案的关键在于引入了MultiSynFact数据集，这是一个包含220万条声明-来源对的大规模多语言事实核查数据集，支持西班牙语、德语、英语及其他低资源语言。该数据集通过利用大型语言模型（LLMs），整合Wikipedia的外部知识，并实施严格的声明验证步骤来确保数据质量。

链接: https://arxiv.org/abs/2502.15419
作者: Yi-Ling Chung,Aurora Cobo,Pablo Serna
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 15 pages, 1 figure, 18 tables

点击查看摘要

Abstract:Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.
zh

[NLP-36] MHQA: A Diverse Knowledge Intensive Mental Health Question Answering Challenge for Language Models

【速读】：该论文旨在解决心理健康领域缺乏标准基准数据集用于问答（QA）的问题。解决方案的关键在于开发了一个名为MHQA（Mental Health Question Answering）的新颖多选题数据集，该数据集专注于四个关键领域：焦虑、抑郁、创伤和强迫/冲动问题，并涵盖了事实性、诊断性、预后性和预防性等多种问题类型。MHQA数据集通过严格的管道从PubMed摘要中提取信息，并基于多种选择标准将其转换为QA对，同时使用后验验证标准筛选出有效的QA对。最终，MHQA数据集包含了2,475个专家验证的黄金标准实例（MHQA-gold）以及约56.1k个使用外部医学参考伪标签的配对。

链接: https://arxiv.org/abs/2502.15418
作者: Suraj Racha,Prashant Joshi,Anshika Raman,Nikita Jangid,Mridul Sharma,Ganesh Ramakrishnan,Nirmal Punjabi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental health remains a challenging problem all over the world, with issues like depression, anxiety becoming increasingly common. Large Language Models (LLMs) have seen a vast application in healthcare, specifically in answering medical questions. However, there is a lack of standard benchmarking datasets for question answering (QA) in mental health. Our work presents a novel multiple choice dataset, MHQA (Mental Health Question Answering), for benchmarking Language models (LMs). Previous mental health datasets have focused primarily on text classification into specific labels or disorders. MHQA, on the other hand, presents question-answering for mental health focused on four key domains: anxiety, depression, trauma, and obsessive/compulsive issues, with diverse question types, namely, factoid, diagnostic, prognostic, and preventive. We use PubMed abstracts as the primary source for QA. We develop a rigorous pipeline for LLM-based identification of information from abstracts based on various selection criteria and converting it into QA pairs. Further, valid QA pairs are extracted based on post-hoc validation criteria. Overall, our MHQA dataset consists of 2,475 expert-verified gold standard instances called MHQA-gold and ~56.1k pairs pseudo labeled using external medical references. We report F1 scores on different LLMs along with few-shot and supervised fine-tuning experiments, further discussing the insights for the scores.
zh

[NLP-37] xtual-to-Visual Iterative Self-Verification for Slide Generation

【速读】：该论文旨在解决生成演示文稿幻灯片这一耗时任务的自动化需求。现有基于大型语言模型（LLM）的自主代理因灵活性有限且缺乏自动精化机制，在实际应用中面临诸多限制。论文的关键解决方案在于将生成缺失幻灯片的任务分解为内容生成和布局生成两个关键部分，并引入了一种通过整合周边幻灯片上下文和利用章节检索策略来增强内容连贯性和相关性的方法。对于布局生成，提出了使用基于LLM的审查者+优化器工作流的文本到视觉自验证过程，将复杂的文本布局转换为直观的视觉格式，从而简化任务，实现准确且类人的审查与优化。实验表明，该方法在对齐性、逻辑流程、视觉吸引力和可读性方面显著优于基线方法。

链接: https://arxiv.org/abs/2502.15412
作者: Yunqing Xu,Xinbei Ma,Jiyang Qiu,Hai Zhao
机构: Xi’an Jiao Tong University(西安交通大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Generating presentation slides is a time-consuming task that urgently requires automation. Due to their limited flexibility and lack of automated refinement mechanisms, existing autonomous LLM-based agents face constraints in real-world applicability. We decompose the task of generating missing presentation slides into two key components: content generation and layout generation, aligning with the typical process of creating academic slides. First, we introduce a content generation approach that enhances coherence and relevance by incorporating context from surrounding slides and leveraging section retrieval strategies. For layout generation, we propose a textual-to-visual self-verification process using a LLM-based Reviewer + Refiner workflow, transforming complex textual layouts into intuitive visual formats. This modality transformation simplifies the task, enabling accurate and human-like review and refinement. Experiments show that our approach significantly outperforms baseline methods in terms of alignment, logical flow, visual appeal, and readability.
zh

[NLP-38] HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

【速读】：该论文旨在解决由iXBRL定义的高度复杂且细粒度的分类法限制跨领域标签可转移性的问题。解决方案的关键在于引入了层次化财务关键绩效指标（Hierarchical Financial Key Performance Indicator, HiFi-KPI）数据集，通过基于分类法分组的方法组织了包含218,126个标签的层次结构，探讨哪一层分类能够提供最有意义的结构。HiFi-KPI数据集包含了约180万段落和500万个实体，并与iXBRL特定的计算和展示分类法中的标签相链接。此外，论文提供了基于编码器的方法和使用大规模语言模型（Large Language Models, LLMs）的结构化提取的基线，并进一步发布了简化LLMs推理和评估的HiFi-KPI Lite子集。

链接: https://arxiv.org/abs/2502.15411
作者: Rasmus Aavang,Giovanni Rizzi,Rasmus Bøggild,Alexandre Iolov,Mike Zhang,Johannes Bjerva
机构: Department of Computer Science, Aalborg University (奥尔堡大学), Denmark; ALIPES ApS (ALIPES ApS), Denmark
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The U.S. Securities and Exchange Commission (SEC) requires that public companies file financial reports tagging numbers with the machine readable inline eXtensible Business Reporting Language (iXBRL) standard. However, the highly complex and highly granular taxonomy defined by iXBRL limits label transferability across domains. In this paper, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, designed to facilitate numerical KPI extraction at specified levels of granularity from unstructured financial text. Our approach organizes a 218,126-label hierarchy using a taxonomy based grouping method, investigating which taxonomy layer provides the most meaningful structure. HiFi-KPI comprises ~1.8M paragraphs and ~5M entities, each linked to a label in the iXBRL-specific calculation and presentation taxonomies. We provide baselines using encoder-based approaches and structured extraction using Large Language Models (LLMs). To simplify LLM inference and evaluation, we additionally release HiFi-KPI Lite, a manually curated subset with four expert-mapped labels. We publicly release all artifacts
zh

[NLP-39] Problem-Solving Logic Guided Curriculum In-Context Learning for LLM s Complex Reasoning

【速读】：该论文旨在解决在-context learning (ICL) 中示范样例的选择与排序问题，以显著提升大型语言模型 (LLMs) 的复杂推理能力。关键在于提出了一种基于解决问题逻辑的课程 ICL 策略。通过分析解决问题的逻辑来选择示范样例，并依据课程学习的原则按难度从易到难进行排序。该方法有效提升了 LLMs 的复杂推理性能和效率。

链接: https://arxiv.org/abs/2502.15401
作者: Xuetao Ma,Wenbin Jiang,Hua Huang
机构: School of Artificial Intelligence, Beijing Normal University (北京师范大学人工智能学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be publicly available subsequently.
zh

[NLP-40] Chitrarth: Bridging Vision and Language for a Billion People

【速读】：该论文旨在解决现有多模态基础模型主要基于英语或高资源欧洲语言数据训练的问题，这限制了它们在其他中低资源语言中的适用性。为了解决这一局限，论文提出了一种名为Chitrarth（Chitra: 图像；Artha: 意义）的包容性视觉-语言模型（Vision-Language Model, VLM），专门针对印度10种主要语言的丰富语言多样性和视觉推理能力。关键解决方案在于将最先进的多语种大型语言模型（Large Language Model, LLM）与视觉模块有效结合，并主要使用多语种图像-文本数据进行训练。此外，论文还引入了BharatBench评估框架，以全面评估VLM在各种印度语言中的表现。通过这些方法，论文实现了对低资源语言基准测试的最佳性能，同时保持了其在英语中的高效性。

链接: https://arxiv.org/abs/2502.15392
作者: Shaharukh Khan,Ayush Tarun,Abhinav Ravi,Ali Faraz,Akshat Patidar,Praveen Kumar Pokala,Anagha Bhangare,Raja Kolla,Chandra Khatri,Shubham Agarwal
机构: Krutrim AI (Krutrim AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent multimodal foundation models are primarily trained on English or high resource European language data, which hinders their applicability to other medium and low-resource languages. To address this limitation, we introduce Chitrarth (Chitra: Image; Artha: Meaning), an inclusive Vision-Language Model (VLM), specifically targeting the rich linguistic diversity and visual reasoning across 10 prominent Indian languages. Our model effectively integrates a state-of-the-art (SOTA) multilingual Large Language Model (LLM) with a vision module, primarily trained on multilingual image-text data. Furthermore, we also introduce BharatBench, a comprehensive framework for evaluating VLMs across various Indian languages, ultimately contributing to more diverse and effective AI systems. Our model achieves SOTA results for benchmarks across low resource languages while retaining its efficiency in English. Through our research, we aim to set new benchmarks in multilingual-multimodal capabilities, offering substantial improvements over existing models and establishing a foundation to facilitate future advancements in this arena.
zh

[NLP-41] Identifying Features that Shape Perceived Consciousness in Large Language Model-based AI: A Quantitative Study of Human Responses

【速读】：该论文旨在定量分析哪些特征使得人类感知到大型语言模型（Large Language Model, LLM）生成的文本具有主观意识。研究通过分析与Claude 3 Opus对话中的99段文本，并聚焦于元认知自我反思、逻辑推理、同理心、情感性、知识性、流畅性、意外性和主观表达性这八个特征，进行了包含123名参与者的调查。研究采用回归和聚类分析方法，探讨这些特征如何影响参与者对AI意识的感知。研究的关键在于发现元认知自我反思和AI表达自身情感显著提高了感知到的意识水平，而过分强调知识则降低了这种感知。此外，参与者被分为七个亚组，每个亚组表现出不同的特征权重模式。该研究强调了感知AI意识的多维性和个体化性质，并为理解人机交互的心理社会影响奠定了基础。

链接: https://arxiv.org/abs/2502.15365
作者: Kang Bongsu,Kim Jundong,Yun Tae-Rim,Bae Hyojin,Kim Chang-Eop
机构: Gachon University College of Korean Medicine(高柳韩国医科大学); Seoul National University College of Medicine(首尔国立大学医学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 pages, 3 figures, 4 tables

点击查看摘要

Abstract:This study quantitively examines which features of AI-generated text lead humans to perceive subjective consciousness in large language model (LLM)-based AI systems. Drawing on 99 passages from conversations with Claude 3 Opus and focusing on eight features – metacognitive self-reflection, logical reasoning, empathy, emotionality, knowledge, fluency, unexpectedness, and subjective expressiveness – we conducted a survey with 123 participants. Using regression and clustering analyses, we investigated how these features influence participants’ perceptions of AI consciousness. The results reveal that metacognitive self-reflection and the AI’s expression of its own emotions significantly increased perceived consciousness, while a heavy emphasis on knowledge reduced it. Participants clustered into seven subgroups, each showing distinct feature-weighting patterns. Additionally, higher prior knowledge of LLMs and more frequent usage of LLM-based chatbots were associated with greater overall likelihood assessments of AI consciousness. This study underscores the multidimensional and individualized nature of perceived AI consciousness and provides a foundation for better understanding the psychosocial implications of human-AI interaction.
zh

[NLP-42] Evaluating Social Biases in LLM Reasoning

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在进行链式思维推理（chain-of-thought reasoning）过程中引入的偏见问题，并特别关注这些偏见如何通过推理步骤被放大。研究的关键在于对比分析DeepSeek-R1的8B和32B变体与其经过指令调优（instruction tuning）的版本在BBQ数据集上的表现，以揭示和量化推理过程中产生的偏见。

链接: https://arxiv.org/abs/2502.15361
作者: Xuyang Wu,Jinming Nian,Zhiqiang Tao,Yi Fang
机构: Santa Clara University; Rochester Institute of Technology (罗彻斯特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the recent development of AI reasoning, large language models (LLMs) are trained to automatically generate chain-of-thought reasoning steps, which have demonstrated compelling performance on math and coding tasks. However, when bias is mixed within the reasoning process to form strong logical arguments, it could cause even more harmful results and further induce hallucinations. In this paper, we have evaluated the 8B and 32B variants of DeepSeek-R1 against their instruction tuned counterparts on the BBQ dataset, and investigated the bias that is elicited out and being amplified through reasoning steps. To the best of our knowledge, this empirical study is the first to assess bias issues in LLM reasoning.
zh

[NLP-43] ARS: Automatic Routing Solver with Large Language Models

【速读】：该论文旨在解决现实世界车辆路径问题（Vehicle Routing Problems, VRPs）中复杂且多样化的实际约束难以通过手动设计求解器来有效处理的问题。论文的关键解决方案是引入RoutBench基准测试集，包含1,000种基于24个属性的不同VRP变体，用于评估自动路由求解器的有效性。同时，提出了自动路由求解器（Automatic Routing Solver, ARS），利用大型语言模型（Large Language Model, LLM）代理自动生成基于问题描述和代表性约束的约束感知启发式代码，从而增强基础算法框架。实验表明，ARS在解决常见VRP方面自动达到了91.67%的成功率，并在所有基准测试中至少提升了30%的表现。

链接: https://arxiv.org/abs/2502.15359
作者: Kai Li,Fei Liu,Zhenkun Wang,Xialiang Tong,Xiongwei Han,Mingxuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real-world Vehicle Routing Problems (VRPs) are characterized by a variety of practical constraints, making manual solver design both knowledge-intensive and time-consuming. Although there is increasing interest in automating the design of routing algorithms, existing research has explored only a limited array of VRP variants and fails to adequately address the complex and prevalent constraints encountered in real-world situations. To fill this gap, this paper introduces RoutBench, a benchmark of 1,000 VRP variants derived from 24 attributes, for evaluating the effectiveness of automatic routing solvers in addressing complex constraints. Along with RoutBench, we present the Automatic Routing Solver (ARS), which employs Large Language Model (LLM) agents to enhance a backbone algorithm framework by automatically generating constraint-aware heuristic code, based on problem descriptions and several representative constraints selected from a database. Our experiments show that ARS outperforms state-of-the-art LLM-based methods and commonly used solvers, automatically solving 91.67% of common VRPs and achieving at least a 30% improvement across all benchmarks.
zh

[NLP-44] AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms

【速读】：该论文旨在解决优化注意力机制（Attention Mechanisms）性能的问题，特别是在不同硬件平台上的适应性和效率。当前的优化策略通常需要大量手动干预，并且针对性较窄，难以适应模型配置或硬件环境的变化。论文的关键解决方案是提出了一种名为AttentionEngine的综合框架，通过将注意力计算分解为具有可定制组件的模块化操作，并结合可编程模板和强大的跨平台调度策略来实现内核优化自动化。这使得AttentionEngine能够灵活适应多样的算法需求，并显著提升了在多种配置下的性能，最高可达现有方法无法企及的10倍性能增益。

链接: https://arxiv.org/abs/2502.15349
作者: Feiyang Chen,Yu Cheng,Lei Wang,Yuqing Xia,Ziming Miao,Lingxiao Ma,Fan Yang,Jilong Xue,Zhi Yang,Mao Yang,Haibo Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Performance (cs.PF)
备注: 15 pages

点击查看摘要

Abstract:Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at this https URL.
zh

[NLP-45] Constructing a Norm for Childrens Scientific Drawing: Distribution Features Based on Semantic Similarity of Large Language Models

【速读】：该论文旨在解决两个主要问题：1）儿童绘画内容依赖于任务且生态有效性低；2）绘画解释过度依赖研究者的主观感受。为了解决这些问题，论文的关键方案是利用大规模语言模型（LLM）识别1420幅涵盖9个科学主题的儿童科学绘画，并使用word2vec算法计算其语义相似性。通过这种方法，论文探索了同一主题下儿童绘画表示的一致性，并尝试建立儿童科学绘画的标准，为后续研究提供基线参考。

链接: https://arxiv.org/abs/2502.15348
作者: Yi Zhang,Fan Wei,Jingyi Li,Yan Wang,Yanyan Yu,Jianli Chen,Zipo Cai,Xinyu Liu,Wei Wang,Peng Wang,Zhong Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of children’s drawings to examining their conceptual understanding has been proven to be an effective method, but there are two major problems with previous research: 1. The content of the drawings heavily relies on the task, and the ecological validity of the conclusions is low; 2. The interpretation of drawings relies too much on the subjective feelings of the researchers. To address this issue, this study uses the Large Language Model (LLM) to identify 1420 children’s scientific drawings (covering 9 scientific themes/concepts), and uses the word2vec algorithm to calculate their semantic similarity. The study explores whether there are consistent drawing representations for children on the same theme, and attempts to establish a norm for children’s scientific drawings, providing a baseline reference for follow-up children’s drawing research. The results show that the representation of most drawings has consistency, manifested as most semantic similarity greater than 0.8. At the same time, it was found that the consistency of the representation is independent of the accuracy (of LLM’s recognition), indicating the existence of consistency bias. In the subsequent exploration of influencing factors, we used Kendall rank correlation coefficient to investigate the effects of Sample Size, Abstract Degree, and Focus Points on drawings, and used word frequency statistics to explore whether children represented abstract themes/concepts by reproducing what was taught in class.
zh

[NLP-46] okenization is Sensitive to Language Variation

【速读】：该论文旨在探究分词器（Tokenizer）在处理语言变异（language variation）时的行为差异，并评估这些差异如何影响下游大型语言模型（LLM）的表现。研究关注两类任务：一类是模型应具备鲁棒性以应对语言变异的任务（如自然语言推理NLI），另一类是模型需敏感于语言变异的任务（如作者验证）。关键解决方案在于通过预训练BERT基础模型，分析分词器算法设计选择（包括语料库拟合、预分词器和词汇表大小）对模型性能的影响，并提出了一种新的方法来估算分词器对下游LLM性能的影响，从而显著改进了诸如Rényi效率等技术。研究表明，最佳分词器的选择取决于任务类型，且预分词器对性能的影响最大。

链接: https://arxiv.org/abs/2502.15343
作者: Anna Wegmann,Dong Nguyen,David Jurgens
机构: Utrecht University (乌特勒支大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models for the popular Byte-Pair Encoding algorithm to investigate how key algorithmic design choices impact downstream models’ performances: fitting corpus, pre-tokenizer and vocabulary size. We find that the best tokenizer varies on the two task types – with the pre-tokenizer having the biggest impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing significant improvement over techniques like Rényi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.
zh

[NLP-47] Stepwise Informativeness Search for Improving LLM Reasoning

【速读】：该论文旨在解决大型语言模型（LLMs）在长上下文推理过程中容易失去焦点的问题，特别是在处理多步推理任务时，可能会忽略早期步骤中的信息，从而导致生成不可靠且冗余的解释。关键解决方案在于引入了一种推理时间树搜索框架——逐步信息性搜索，并结合两种选择启发式方法：基于定位的优先选择和基于新颖性的选择。同时，通过自定位策略促使LLMs在每个推理步骤中显式引用相关前期步骤的信息，以提供前提条件，从而生成更准确和简洁的逐步解释。

链接: https://arxiv.org/abs/2502.15335
作者: Siyuan Wang,Enda Zhao,Zhongyu Wei,Xiang Ren
机构: University of Southern California(南加州大学); Tsinghua University(清华大学); Fudan University(复旦大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Advances in Large Language Models (LLMs) have significantly improved multi-step reasoning through generating free-text rationales. However, recent studies show that LLMs tend to lose focus over the middle of long contexts. This raises concerns that as reasoning progresses, LLMs may overlook information in earlier steps when decoding subsequent steps, leading to generate unreliable and redundant rationales. To address this, we propose guiding LLMs to generate more accurate and concise step-by-step rationales by (1) proactively referencing information from underutilized prior steps, and (2) minimizing redundant information between new and existing steps. We introduce stepwise informativeness search, an inference-time tree search framework incorporating two selection heuristics: grounding-guided selection which prioritizes steps paying higher attention over underutilized steps; and novelty-guided selection which encourages steps with novel conclusions. During rationale generation, we use a self-grounding strategy that prompts LLMs to explicitly reference relevant prior steps to provide premises before deduction at each step. Experimental results on four reasoning datasets demonstrate that our approach improves reasoning accuracy by generating higher-quality rationales with reduced errors and redundancy.
zh

[NLP-48] Detecting Future-related Contexts of Entity Mentions

【速读】：该论文旨在解决自动识别文本中实体是否被隐式地提及于未来情境中的问题，这在决策制定、规划及趋势预测等领域具有广泛应用。为应对信息处理中日益增长的自动化时间分析需求，论文构建了一个包含19,540个句子的新颖数据集，这些句子围绕维基百科中热门实体展开，并包含与不包含这些实体的未来相关情境。论文的关键解决方案在于评估多种语言模型（包括大型语言模型）在无显式时间参考的情况下区分面向未来的文本内容的能力。

链接: https://arxiv.org/abs/2502.15332
作者: Puneet Prashar,Krishna Mohan Shukla,Adam Jatowt
机构: Rajiv Gandhi Institute of Petroleum Technology (拉吉夫·甘地石油技术研究所); University of Innsbruck (因斯布鲁克大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The ability to automatically identify whether an entity is referenced in a future context can have multiple applications including decision making, planning and trend forecasting. This paper focuses on detecting implicit future references in entity-centric texts, addressing the growing need for automated temporal analysis in information processing. We first present a novel dataset of 19,540 sentences built around popular entities sourced from Wikipedia, which consists of future-related and non-future-related contexts in which those entities appear. As a second contribution, we evaluate the performance of several Language Models including also Large Language Models (LLMs) on the task of distinguishing future-oriented content in the absence of explicit temporal references.
zh

[NLP-49] SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）高效推理过程中关键值（KV）缓存的有效压缩问题。论文的关键解决方案是提出了一种基于奇异值分解（Singular Value Decomposition, SVD）的混合精度量化方法，称为SVDq。该方法首先将KV缓存转换为潜在通道，并利用SVD基表示。随后，通过重要性感知量化和压缩，对这些潜在通道进行处理，从而有效分配更高的精度到更重要的通道。这种方法理论上证明能够显著降低量化误差，相比原始空间中的逐通道关键值量化，其量化误差可降低至原来的十分之一或更低。实验结果表明，SVDq能够在保持模型性能的同时，实现低至1.25比特的关键值缓存精度，并且结合关键值稀疏性时，可以达到高达410倍的关键值压缩比。

链接: https://arxiv.org/abs/2502.15304
作者: Hong Yankun,Li Xing,Zhen Hui-Ling,Yu Xianzhi,Liu Wulong,Yuan Mingxuan
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:For the efficient inference of Large Language Models (LLMs), the effective compression of key-value (KV) cache is essential. Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache. Initially, K cache is transformed into latent channels using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors (x0.1 or even lower) that are much lower than those of per-channel key quantization in the original space. Our findings based on RULER and LongBench benchmarks demonstrate that SVDq can achieve an equivalent key cache precision as low as 1.25-bit. When combined with key sparsity, it can reach a key compression ratio of up to 410x for attention computation, all while maintaining comparable model performance. Notably, our method is nearly lossless for LongBench datasets. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for KV cache compression in LLMs.
zh

[NLP-50] Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

【速读】：该论文旨在解决大型语言模型（LLMs）在处理多轮对话时因存储大量KV缓存而导致GPU内存使用率高和系统效率降低的问题。关键解决方案在于提出了一种名为“Round Attention”的新型多轮关注机制，该机制仅召回和计算最相关的对话轮次的KV缓存，从而节省了55%的内存使用而不影响模型性能。

链接: https://arxiv.org/abs/2502.15294
作者: Yaohua Tang,Zhicheng Hu,Kun Cheng,Fan Mo,Qiheng Lv,Hua Wang,Zhi Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. We propose Round Attention, a novel round-level attention mechanism that only recalls and computes the KV cache of the most relevant rounds. The experiments show that our method saves 55% memory usage without compromising model performance.
zh

[NLP-51] Analyzing the Inner Workings of Transformers in Compositional Generalization NAACL2025

【速读】：该论文旨在解决神经模型在组合泛化能力上的内部机制不清晰的问题。论文的关键解决方案在于通过发现对泛化性能有贡献的子网络，并对其如何利用句法特征进行因果分析，从而揭示模型依赖句法特征输出正确答案的机制。研究还发现，相较于整体模型，该子网络依赖于一种非组合式的算法以实现更好的泛化性能，且其泛化性能提升较慢，非组合式解决方案在训练早期阶段即被获取。

链接: https://arxiv.org/abs/2502.15277
作者: Ryoma Kumon,Hitomi Yanaka
机构: The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 main

点击查看摘要

Abstract:The compositional generalization abilities of neural models have been sought after for human-like linguistic competence. The popular method to evaluate such abilities is to assess the models’ input-output behavior. However, that does not reveal the internal mechanisms, and the underlying competence of such models in compositional generalization remains unclear. To address this problem, we explore the inner workings of a Transformer model by finding an existing subnetwork that contributes to the generalization performance and by performing causal analyses on how the model utilizes syntactic features. We find that the model depends on syntactic features to output the correct answer, but that the subnetwork with much better generalization performance than the whole model relies on a non-compositional algorithm in addition to the syntactic features. We also show that the subnetwork improves its generalization performance relatively slowly during the training compared to the in-distribution one, and the non-compositional solution is acquired in the early stages of the training.
zh

[NLP-52] A Training-free LLM -based Approach to General Chinese Character Error Correction

【速读】：该论文旨在解决通用中文字符错误校正（C2EC）任务中全面纠正三种类型字符错误的问题，特别是缺失和冗余字符错误，这些错误在现有数据集中常被忽视。关键解决方案在于通过使用Levenshtein距离处理长度变化，并利用基于提示的大语言模型（LLM）提升性能，从而将无需训练和提示的中文字符校正（CSC）方法扩展到C2EC任务中。实验表明，该方法使一个140亿参数的LLM在传统CSC和C2EC任务上与近50倍规模的模型表现相当，且无需微调。

链接: https://arxiv.org/abs/2502.15266
作者: Houquan Zhou,Bo Zhang,Zhenghua Li,Ming Yan,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 12 figures

点击查看摘要

Abstract:Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.
zh

[NLP-53] Retrieval-Augmented Speech Recognition Approach for Domain Challenges

【速读】：该论文旨在解决语音识别系统在实际应用中因领域不匹配导致的问题，特别是在缺乏特定领域数据的情况下。为应对这一挑战，论文提出了一种基于大型语言模型（LLM）的检索增强型语音识别方法。该方法的关键在于利用推理阶段提供的领域特定文本数据来提升识别性能，而不是在训练阶段依赖这些数据。通过借鉴检索增强生成（Retrieval-Augmented Generation, RAG）技术的优势，该方法能够高效访问本地可用的领域特定文档，从而便捷且有效地解决了领域不匹配问题。实验结果表明，该方法显著提升了CSJ数据库上的语音识别准确性，并达到了当前最先进水平。

链接: https://arxiv.org/abs/2502.15264
作者: Peng Shen,Xugang Lu,Hisashi Kawai
机构: National Institute of Information and Communications Technology (信息通信技术国立研究所)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective process for solving domain mismatch problems. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data.
zh

[NLP-54] Corrections Meet Explanations: A Unified Framework for Explainable Grammatical Error Correction

【速读】：该论文旨在解决生成式语法纠错（Grammatical Error Correction, GEC）系统在可解释性方面面临的挑战，特别是在服务于语言学习者时。现有研究主要集中在预先提取的语法错误的解释上，忽视了解释与纠正之间的关联。为填补这一空白，论文引入了EXGEC框架，这是一种统一的可解释GEC方法，通过生成式方式整合了解释和纠正任务，主张这两个任务相互促进。解决方案的关键在于EXGEC框架的设计，它通过将解释和纠正任务结合在一起，以增强模型的可解释性和有效性。此外，论文还检测到EXPECT数据集中的显著噪声，并提出了一个去噪后的替代数据集EXPECT-denoised，以确保更客观的训练和评估环境。实验结果表明，在不同的自然语言处理模型上，EXGEC模型在两个任务上均超越了单一任务基线模型，证明了该方法的有效性。

链接: https://arxiv.org/abs/2502.15261
作者: Jingheng Ye,Shang Qin,Yinghui Li,Hai-Tao Zheng,Shen Wang,Qingsong Wen
机构: Tsinghua University (清华大学); Peng Cheng Laboratory (鹏城实验室); Squirrel Ai Learning (松鼠Ai学习)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 2 figures, and 9 tables

点击查看摘要

Abstract:Grammatical Error Correction (GEC) faces a critical challenge concerning explainability, notably when GEC systems are designed for language learners. Existing research predominantly focuses on explaining grammatical errors extracted in advance, thus neglecting the relationship between explanations and corrections. To address this gap, we introduce EXGEC, a unified explainable GEC framework that integrates explanation and correction tasks in a generative manner, advocating that these tasks mutually reinforce each other. Experiments have been conducted on EXPECT, a recent human-labeled dataset for explainable GEC, comprising around 20k samples. Moreover, we detect significant noise within EXPECT, potentially compromising model training and evaluation. Therefore, we introduce an alternative dataset named EXPECT-denoised, ensuring a more objective framework for training and evaluation. Results on various NLP models (BART, T5, and Llama3) show that EXGEC models surpass single-task baselines in both tasks, demonstrating the effectiveness of our approach.
zh

[NLP-55] LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design DATE2025

【速读】：该论文旨在解决Mamba模型在推理过程中难以加速的问题，主要由于散乱的激活异常值和复杂的计算依赖关系，导致现有的大型语言模型（LLMs）加速器效率低下。论文的关键解决方案是提出LightMamba，这是一种针对高效Mamba推理进行协同设计的量化算法和FPGA加速器架构。具体而言，LightMamba首先采用了一种FPGA友好的后训练量化算法，通过旋转辅助量化和二的幂次状态空间模型（SSM）量化，将大部分计算简化至4位。此外，还设计了一个部分展开Mamba计算的FPGA加速器，通过计算重排以及细粒度的分块和融合技术，显著提高了硬件利用率和内存效率。

链接: https://arxiv.org/abs/2502.15260
作者: Renjie Wei,Songqiang Xu,Linfeng Zhong,Zebin Yang,Qingyu Guo,Yuan Wang,Runsheng Wang,Meng Li
机构: Institute for Artificial Intelligence (人工智能研究所), School of Integrated Circuits (集成电路学院), Peking University (北京大学), Beijing, China (中国); Beijing Advanced Innovation Center for Integrated Circuits (北京集成电路高精尖创新中心), Beijing, China (中国); Institute of Electronic Design Automation (电子设计自动化研究所), Peking University (北京大学), Wuxi, China (中国); School of Software and Microelectronics (软件与微电子学院), Peking University (北京大学), Beijing, China (中国); School of Electronic and Computer Engineering (电子与计算机工程学院), Peking University (北京大学), Shenzhen, China (中国)
类目: Computation and Language (cs.CL)
备注: Accepted by DATE 2025

点击查看摘要

Abstract:State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65x to 6.06x higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43x that of the GPU baseline.
zh

[NLP-56] A General Pseudonymization Framework for Cloud-Based LLM s: Replacing Privacy Information in Controlled Text Generation

【速读】：该论文旨在解决云基础大语言模型（LLMs）远程使用过程中用户的隐私保护问题。论文的关键在于首次提出了一种适用于云基础LLMs的通用伪匿名化框架，并通过实验验证了该框架在隐私保护和实用性之间取得了最优平衡。

链接: https://arxiv.org/abs/2502.15233
作者: Shilong Hou,Ruilin Shang,Zi Long,Xianghua Fu,Yin Chen
机构: College of Application and Technology, Shenzhen University(应用技术学院，深圳大学); College of Big Data and Internet, Shenzhen Technology University(大数据与互联网学院，深圳技术大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users’ prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM’s inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at this https URL.
zh

[NLP-57] Understand User Opinions of Large Language Models via LLM -Powered In-the-Moment User Experience Interviews

【速读】：该论文旨在探究实际用户对主流大型语言模型（Large Language Models, LLMs）的真实看法。为了实现这一目标，论文提出了一种名为CLUE的解决方案，这是一种由LLM驱动的采访者，能够在用户与LLMs交互后立即进行用户体验访谈，并自动收集大量访谈日志中的用户意见。CLUE的关键在于其能够即时捕捉用户的实时反馈，从而揭示用户对不同LLMs的具体观点和需求，如对DeepSeek-R1推理过程的不同看法以及对信息新鲜度和多模态功能的需求。

链接: https://arxiv.org/abs/2502.15226
作者: Mengqiao Liu,Tevin Wang,Cassandra A. Cohen,Sarah Li,Chenyan Xiong
机构: Amazon(亚马逊); School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机学院); University of California, Berkeley(加利福尼亚大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interacted with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, for example, the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our collected chat-and-interview logs will be released.
zh

[NLP-58] A BERT Based Hybrid Recommendation System For Academic Collaboration

【速读】：该论文旨在解决大型学术机构中传统网络连接方法因规模扩大而变得低效的问题。论文的关键解决方案是提出了一种针对学术界的个性化推荐系统，通过使用混合模型（结合Term Frequency-Inverse Document Frequency (TF-IDF) 和Bidirectional Encoder Representations from Transformers (BERT)），能够有效地连接具有相似兴趣的师生，并通过Affinity Propagation聚类实现未标记数据集的重新标记。该混合模型在多样性与相关性之间实现了最优平衡，且已被开发成移动应用程序，可根据用户的技能和合作兴趣动态推荐相关学术个人资料。

链接: https://arxiv.org/abs/2502.15223
作者: Sangeetha N,Harish Thangaraj,Varun Vashisht,Eshaan Joshi,Kanishka Verma,Diya Katariya
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: International Conference on Intelligent Systems and Security - 2024

点击查看摘要

Abstract:Universities serve as a hub for academic collaboration, promoting the exchange of diverse ideas and perspectives among students and faculty through interdisciplinary dialogue. However, as universities expand in size, conventional networking approaches via student chapters, class groups, and faculty committees become cumbersome. To address this challenge, an academia-specific profile recommendation system is proposed to connect like-minded stakeholders within any university community. This study evaluates three techniques: Term Frequency-Inverse Document Frequency (TF-IDF), Bidirectional Encoder Representations from Transformers (BERT), and a hybrid approach to generate effective recommendations. Due to the unlabelled nature of the dataset, Affinity Propagation cluster-based relabelling is performed to understand the grouping of similar profiles. The hybrid model demonstrated superior performance, evidenced by its similarity score, Silhouette score, Davies-Bouldin index, and Normalized Discounted Cumulative Gain (NDCG), achieving an optimal balance between diversity and relevance in recommendations. Furthermore, the optimal model has been implemented as a mobile application, which dynamically suggests relevant profiles based on users’ skills and collaboration interests, incorporating contextual understanding. The potential impact of this application is significant, as it promises to enhance networking opportunities within large academic institutions through the deployment of intelligent recommendation systems.
zh

[NLP-59] ESPnet-SpeechLM: An Open Speech Language Model Toolkit

【速读】：该论文旨在解决语音语言模型（Speech Language Models, SpeechLMs）开发的可访问性和效率问题。解决方案的关键在于ESPnet-SpeechLM工具包，它通过标准化语音处理任务为通用序列建模问题，提供了一个包含数据预处理、预训练、推理和任务评估在内的统一工作流程。ESPnet-SpeechLM允许用户轻松定义任务模板和配置关键设置，确保了灵活性、高效性和可扩展性，并通过高度可配置的模块支持整个工作流程的每个阶段。

链接: https://arxiv.org/abs/2502.15218
作者: Jinchuan Tian,Jiatong Shi,William Chen,Siddhant Arora,Yoshiki Masuyama,Takashi Maekaku,Yihan Wu,Junyi Peng,Shikhar Bharadwaj,Yiwen Zhao,Samuele Cornell,Yifan Peng,Xiang Yue,Chao-Han Huck Yang,Graham Neubig,Shinji Watanabe
机构: Carnegie Mellon University (卡内基梅隆大学); Mitsubishi Electric Research Laboratories (三菱电机研究实验室); LY Corporation (LY公司); Renmin University of China (中国人民大学); Brno University of Technology (布拉格科技大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: this https URL.
zh

[NLP-60] he Evolving Landscape of LLM - and VLM-Integrated Reinforcement Learning

【速读】：该论文旨在解决将大型语言模型（Large Language Models, LLMs）和视觉-语言模型（Vision-Language Models, VLMs）融入强化学习（Reinforcement Learning, RL）中的关键挑战，包括缺乏先验知识、长期规划以及奖励设计。论文的关键解决方案在于提出了一种分类方法，将这些辅助RL的方法分为三类角色：代理（agent）、规划器（planner）和奖励（reward），从而为整合LLMs和VLMs到RL框架提供了理论基础，并推动了自然语言与视觉理解与序列决策的统一方法的发展。

链接: https://arxiv.org/abs/2502.15214
作者: Sheila Schoepp,Masoud Jafaripour,Yingyue Cao,Tianpei Yang,Fatemeh Abdollahi,Shadan Golestan,Zahin Sufiyan,Osmar R. Zaiane,Matthew E. Taylor
机构: University of Alberta; Nanjing University; Alberta Machine Intelligence Institute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Reinforcement learning (RL) has shown impressive results in sequential decision-making tasks. Meanwhile, Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged, exhibiting impressive capabilities in multimodal understanding and reasoning. These advances have led to a surge of research integrating LLMs and VLMs into RL. In this survey, we review representative works in which LLMs and VLMs are used to overcome key challenges in RL, such as lack of prior knowledge, long-horizon planning, and reward design. We present a taxonomy that categorizes these LLM/VLM-assisted RL approaches into three roles: agent, planner, and reward. We conclude by exploring open problems, including grounding, bias mitigation, improved representations, and action advice. By consolidating existing research and identifying future directions, this survey establishes a framework for integrating LLMs and VLMs into RL, advancing approaches that unify natural language and visual understanding with sequential decision-making.
zh

[NLP-61] PairBench: A Systematic Framework for Selecting Reliable Judge VLMs

【速读】：该论文旨在解决大型视觉语言模型（Vision Language Models, VLMs）作为自动化评估工具时，如何有效地根据提示比较数据对的问题。论文的关键解决方案是提出PairBench框架，该框架通过四个关键指标系统性地评估VLMs在不同模态和场景下的相似性工具性能，从而帮助理解这些模型在实际应用中的行为特性及其局限性。

链接: https://arxiv.org/abs/2502.15210
作者: Aarash Feizi,Sai Rajeswar,Adriana Romero-Soriano,Reihaneh Rabbany,Spandana Gella,Valentina Zantedeschi,João Monteiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator’s desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.
zh

[NLP-62] Unveiling Attractor Cycles in Large Language Models : A Dynamical Systems View of Successive Paraphrasing

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在迭代生成文本过程中的长期行为，并揭示其收敛于稳定周期状态的现象。论文的关键在于应用动力系统理论来分析LLMs的迭代映射过程，通过连续释义（paraphrasing）实验发现，尽管LLMs理论上应探索多样化的文本释义，但实际上它们会收敛到如2周期吸引子循环等稳定的周期状态，从而限制了语言表达的多样性。这一现象归因于LLMs自我强化的特性，即在迭代过程中倾向于和放大某些特定的文本形式，而这种模式即使在增加生成随机性或交替使用提示和LLMs的情况下依然存在。这些发现强调了LLMs生成能力固有的局限性，并提供了研究其表达潜力的新视角。

链接: https://arxiv.org/abs/2502.15208
作者: Zhilin Wang,Yafu Li,Jianhao Yan,Yu Cheng,Yue Zhang
机构: Shanghai AI Laboratory (上海人工智能实验室); Westlake University (西湖大学); Zhejiang University (浙江大学); Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Dynamical systems theory provides a framework for analyzing iterative processes and evolution over time. Within such systems, repetitive transformations can lead to stable configurations, known as attractors, including fixed points and limit cycles. Applying this perspective to large language models (LLMs), which iteratively map input text to output text, provides a principled approach to characterizing long-term behaviors. Successive paraphrasing serves as a compelling testbed for exploring such dynamics, as paraphrases re-express the same underlying meaning with linguistic variation. Although LLMs are expected to explore a diverse set of paraphrases in the text space, our study reveals that successive paraphrasing converges to stable periodic states, such as 2-period attractor cycles, limiting linguistic diversity. This phenomenon is attributed to the self-reinforcing nature of LLMs, as they iteratively favour and amplify certain textual forms over others. This pattern persists with increasing generation randomness or alternating prompts and LLMs. These findings underscore inherent constraints in LLM generative capability, while offering a novel dynamical systems perspective for studying their expressive potential.
zh

[NLP-63] ETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

【速读】：该论文旨在解决多请求环境下批量推测解码（speculative decoding）的总吞吐量优化问题。关键在于TETRIS方法能够主动选择每个请求中最有可能成功的候选标记（draft tokens），并在并行验证时予以接受，从而减少被拒绝的标记数量，提高计算资源的有效利用率。这种方法特别适用于具有有限推理能力的服务提供商，以实现大型语言模型（LLMs）中的快速推理。与基线推测解码相比，TETRIS展现出更高的接受率和更有效的资源利用。

链接: https://arxiv.org/abs/2502.15197
作者: Zhaoxuan Wu,Zijian Zhou,Arun Verma,Alok Prakash,Daniela Rus,Bryan Kian Hsiang Low
机构: Singapore-MIT Alliance for Research and Technology(新加坡-麻省理工学院联盟研究中心); Dept. of Computer Science, National University of Singapore(国立新加坡大学计算机科学系); CSAIL, Massachusetts Institute of Technology(计算机科学与人工智能实验室, 麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures, 5 tables

点击查看摘要

Abstract:We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.
zh

[NLP-64] Scale-Free Graph-Language Models

【速读】：该论文旨在解决图生成过程中依赖人工假设以及文本嵌入需要大量数据标注的问题。关键解决方案在于引入了一个新颖的图-语言模型（Graph-Language Model, GLM），该模型在一个统一框架内整合了图生成和文本嵌入过程。具体而言，对于图生成，利用了真实边分布的无标度特性作为结构先验，并发现简单的k近邻图（KNN）可以有效地逼近这一性质。对于文本嵌入，则开发了一种基于图的伪标记器，利用无标度图提供互补监督以改进语言模型微调。

链接: https://arxiv.org/abs/2502.15189
作者: Jianglin Lu,Yixuan Liu,Yitian Zhang,Yun Fu
机构: Department of Electrical and Computer Engineering, Northeastern University (东北大学电气与计算机工程学院); Network Science Institute, Northeastern University (东北大学网络科学研究所); Khoury College of Computer Science, Northeastern University (东北大学克霍尔伊计算机科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-language models (GLMs) have demonstrated great potential in graph-based semi-supervised learning. A typical GLM consists of two key stages: graph generation and text embedding, which are usually implemented by inferring a latent graph and finetuning a language model (LM), respectively. However, the former often relies on artificial assumptions about the underlying edge distribution, while the latter requires extensive data annotations. To tackle these challenges, this paper introduces a novel GLM that integrates graph generation and text embedding within a unified framework. Specifically, for graph generation, we leverage an inherent characteristic of real edge distribution–the scale-free property–as a structural prior. We unexpectedly find that this natural property can be effectively approximated by a simple k-nearest neighbor (KNN) graph. For text embedding, we develop a graph-based pseudo-labeler that utilizes scale-free graphs to provide complementary supervision for improved LM finetuning. Extensive experiments on representative datasets validate our findings on the scale-free structural approximation of KNN graphs and demonstrate the effectiveness of integrating graph generation and text embedding with a real structural prior. Our code is available at this https URL.
zh

[NLP-65] BP-GPT : Auditory Neural Decoding Using fMRI-prompted LLM

【速读】：该论文旨在解决从功能性磁共振成像（fMRI）信号中解码语言信息的问题，特别是语义信息的解码。现有方法通常不采用端到端的方法，并且在fMRI到文本的映射过程中避免使用大型语言模型（LLM），这为探索LLM在听觉解码中的应用留下了空间。论文的关键解决方案是引入了一种新颖的方法——脑提示GPT（BP-GPT）。该方法利用从fMRI中提取的脑表示作为提示，通过GPT-2将fMRI信号解码为刺激文本。此外，通过引入文本提示并与fMRI提示对齐，BP-GPT能够提取更稳健的脑提示，并促进预训练LLM的解码。实验结果表明，使用脑表示作为提示进一步驱动LLM进行听觉神经解码是可行且有效的。

链接: https://arxiv.org/abs/2502.15172
作者: Xiaoyu Chen,Changde Du,Che Liu,Yizhe Wang,Huiguang He
机构: 1111 NeuBCI Group, State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, CASIA(CASIA脑认知与类脑智能技术重点实验室);
2222 School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院);
3333 School of Future Technology, University of Chinese Academy of Sciences(中国科学院大学未来技术学院);
4444 State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA(CASIA多模态人工智能系统重点实验室)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2405.07840

点击查看摘要

Abstract:Decoding language information from brain signals represents a vital research area within brain-computer interfaces, particularly in the context of deciphering the semantic information from the fMRI signal. Although existing work uses LLM to achieve this goal, their method does not use an end-to-end approach and avoids the LLM in the mapping of fMRI-to-text, leaving space for the exploration of the LLM in auditory decoding. In this paper, we introduce a novel method, the Brain Prompt GPT (BP-GPT). By using the brain representation that is extracted from the fMRI as a prompt, our method can utilize GPT-2 to decode fMRI signals into stimulus text. Further, we introduce the text prompt and align the fMRI prompt to it. By introducing the text prompt, our BP-GPT can extract a more robust brain prompt and promote the decoding of pre-trained LLM. We evaluate our BP-GPT on the open-source auditory semantic decoding dataset and achieve a significant improvement up to 4.61 on METEOR and 2.43 on BERTScore across all the subjects compared to the state-of-the-art method. The experimental results demonstrate that using brain representation as a prompt to further drive LLM for auditory neural decoding is feasible and effective. The code is available at this https URL.
zh

[NLP-66] mStyleDistance: Multilingual Style Embeddings and their Evaluation

【速读】：该论文旨在解决多语言风格嵌入（Multilingual Style Embeddings）的问题。目前仅存在英语风格嵌入，而本文提出了一种名为Multilingual StyleDistance (mStyleDistance) 的模型，通过使用合成数据和对比学习（Contrastive Learning）的方法进行训练。该模型的关键在于利用九种不同语言的数据进行训练，并构建了一个多语言STEL-or-Content基准来评估嵌入的质量。此外，作者还将这些嵌入应用于涉及多种语言的作者身份验证任务中。研究表明，mStyleDistance嵌入在多语言风格基准测试中优于现有模型，并且具有良好的泛化能力，能够适用于未见过的语言特征。

链接: https://arxiv.org/abs/2502.15168
作者: Justin Qiu,Jiacheng Zhu,Ajay Patel,Marianna Apidianaki,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2410.12757

点击查看摘要

Abstract:Style embeddings are useful for stylistic analysis and style transfer; however, only English style embeddings have been made available. We introduce Multilingual StyleDistance (mStyleDistance), a multilingual style embedding model trained using synthetic data and contrastive learning. We train the model on data from nine languages and create a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess the embeddings’ quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing models on these multilingual style benchmarks and generalize well to unseen features and languages. We make our model publicly available at this https URL .
zh

[NLP-67] Extreme Speech Classification in the Era of LLM s: Exploring Open-Source and Proprietary Models

【速读】：该论文旨在解决极端言论在线分类的挑战，特别是利用大规模语言模型（Large Language Models, LLMs）来自动分类极端言论。解决方案的关键在于通过微调领域特定数据显著提升预训练LLMs的性能，从而使其更好地适应语言和上下文的细微差别。虽然基于GPT的模型在零样本设置下表现优于Llama模型，但经过微调后，两者之间的性能差距消失。

链接: https://arxiv.org/abs/2502.15155
作者: Sarthak Mahajan,Nimmi Rangaswamy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to 7th International Conference on information systems and management science (ISMS), 2024

点击查看摘要

Abstract:In recent years, widespread internet adoption and the growth in userbase of various social media platforms have led to an increase in the proliferation of extreme speech online. While traditional language models have demonstrated proficiency in distinguishing between neutral text and non-neutral text (i.e. extreme speech), categorizing the diverse types of extreme speech presents significant challenges. The task of extreme speech classification is particularly nuanced, as it requires a deep understanding of socio-cultural contexts to accurately interpret the intent of the language used by the speaker. Even human annotators often disagree on the appropriate classification of such content, emphasizing the complex and subjective nature of this task. The use of human moderators also presents a scaling issue, necessitating the need for automated systems for extreme speech classification. The recent launch of ChatGPT has drawn global attention to the potential applications of Large Language Models (LLMs) across a diverse variety of tasks. Trained on vast and diverse corpora, and demonstrating the ability to effectively capture and encode contextual information, LLMs emerge as highly promising tools for tackling this specific task of extreme speech classification. In this paper, we leverage the Indian subset of the extreme speech dataset from Maronikolakis et al. (2022) to develop an effective classification framework using LLMs. We evaluate open-source Llama models against closed-source OpenAI models, finding that while pre-trained LLMs show moderate efficacy, fine-tuning with domain-specific data significantly enhances performance, highlighting their adaptability to linguistic and contextual nuances. Although GPT-based models outperform Llama models in zero-shot settings, the performance gap disappears after fine-tuning.
zh

[NLP-68] Investigating the Adaptive Robustness with Knowledge Conflicts in LLM -based Multi-Agent Systems

【速读】：该论文旨在探究基于大型语言模型（Large Language Models, LLMs）的多智能体系统（multi-agent systems, MASs）在面对轻度或任务关键型知识冲突时的鲁棒性。论文设计了四个全面的指标来评估这些系统的稳健性。关键在于发现这些系统不仅能够承受知识冲突，甚至还能通过减少对冲突知识的依赖并采用替代解决方案路径来自我修复，从而维持稳定性。研究表明，即使存在知识冲突，MASs 的鲁棒性仍然保持较高水平，并且这种自我修复能力具有内在的局限性。

链接: https://arxiv.org/abs/2502.15153
作者: Tianjie Ju,Bowen Wang,Hao Fei,Mong-Li Lee,Wynne Hsu,Yun Li,Qianren Wang,Pengzhou Cheng,Zongru Wu,Zhuosheng Zhang,Gongshen Liu
机构: Shanghai Jiao Tong University(上海交通大学); National Univeristy of Singapore(新加坡国立大学); Cognitive AI Lab
类目: Computation and Language (cs.CL)
备注: Working in progress

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of corporation and tool use in multi-agent systems (MASs). However, the robustness of these LLM-based MASs, especially under knowledge conflicts, remains unclear. In this paper, we design four comprehensive metrics to investigate the robustness of MASs when facing mild or task-critical knowledge conflicts. We first analyze mild knowledge conflicts introduced by heterogeneous agents and find that they do not harm system robustness but instead improve collaborative decision-making. Next, we investigate task-critical knowledge conflicts by synthesizing knowledge conflicts and embedding them into one of the agents. Our results show that these conflicts have surprisingly little to no impact on MAS robustness. Furthermore, we observe that MASs demonstrate certain self-repairing capabilities by reducing their reliance on knowledge conflicts and adopting alternative solution paths to maintain stability. Finally, we conduct ablation studies on the knowledge conflict number, agent number, and interaction rounds, finding that the self-repairing capability of MASs has intrinsic limits, and all findings hold consistently across various factors. Our code is publicly available at this https URL.
zh

[NLP-69] Latent Factor Models Meets Instructions:Goal-conditioned Latent Factor Discovery without Task Supervision NAACL2025

【速读】：该论文旨在解决通过指令跟随大型语言模型（LLMs）发现隐藏概念时所遇到的质量不稳定问题，尤其是在数据噪声较大或超出LLM知识范围的情况下。论文的关键解决方案是提出了一种名为Instruct-LF的目标导向潜在因子发现系统，该系统结合了LLM的指令跟随能力与统计模型，以应对大规模、噪声数据集。Instruct-LF通过LLM从文档中提出细粒度、与目标相关属性，并估计这些属性在整个数据集中的存在情况，利用基于梯度的优化来揭示隐藏因子，每个因子由共现属性的簇表示。这一方法显著提升了下游任务性能，并在人类评估中更受欢迎。

链接: https://arxiv.org/abs/2502.15147
作者: Zhouhang Xie,Tushar Khot,Bhavana Dalvi Mishra,Harshit Surana,Julian McAuley,Peter Clark,Bodhisattwa Prasad Majumder
机构: University of California, San Diego (加州大学圣地亚哥分校); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: NAACL 2025

点击查看摘要

Abstract:Instruction-following LLMs have recently allowed systems to discover hidden concepts from a collection of unstructured documents based on a natural language description of the purpose of the discovery (i.e., goal). Still, the quality of the discovered concepts remains mixed, as it depends heavily on LLM’s reasoning ability and drops when the data is noisy or beyond LLM’s knowledge. We present Instruct-LF, a goal-oriented latent factor discovery system that integrates LLM’s instruction-following ability with statistical models to handle large, noisy datasets where LLM reasoning alone falls short. Instruct-LF uses LLMs to propose fine-grained, goal-related properties from documents, estimates their presence across the dataset, and applies gradient-based optimization to uncover hidden factors, where each factor is represented by a cluster of co-occurring properties. We evaluate latent factors produced by Instruct-LF on movie recommendation, text-world navigation, and legal document categorization tasks. These interpretable representations improve downstream task performance by 5-52% than the best baselines and were preferred 1.8 times as often as the best alternative, on average, in human evaluation. Comments: NAACL 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.15147 [cs.CL] (or arXiv:2502.15147v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.15147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-70] Do LLM s Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns

【速读】：该论文旨在探究大型语言模型（LLMs）在生成多选题（MCQs）干扰选项时与学生实际选择之间的关系。研究通过收集包含真实学生反应分布的MCQ数据集，探讨了两个核心问题：1）学生更频繁选择的干扰选项是否与LLMs赋予更高生成概率的选项一致；2）当LLMs选择错误答案时，是否倾向于选择大多数学生也容易选择的错误选项。研究结果表明，LLMs生成的干扰选项与学生的实际选择之间存在中等程度的相关性，并且当LLMs出错时，它们倾向于选择那些经常误导学生的错误答案，这一模式在小规模和大规模语言模型中均一致。论文的关键在于揭示了LLMs在识别混淆干扰项方面的潜在能力，这为自动化生成高质量干扰选项提供了新的机会，从而有助于改进教育评估的发展。

链接: https://arxiv.org/abs/2502.15140
作者: Naiming Liu,Shashank Sonkar,Richard G. Baraniuk
机构: Rice University (莱斯大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various educational tasks, yet their alignment with human learning patterns, particularly in predicting which incorrect options students are most likely to select in multiple-choice questions (MCQs), remains underexplored. Our work investigates the relationship between LLM generation likelihood and student response distributions in MCQs with a specific focus on distractor selections. We collect a comprehensive dataset of MCQs with real-world student response distributions to explore two fundamental research questions: (1). RQ1 - Do the distractors that students more frequently select correspond to those that LLMs assign higher generation likelihood to? (2). RQ2 - When an LLM selects a incorrect choice, does it choose the same distractor that most students pick? Our experiments reveals moderate correlations between LLM-assigned probabilities and student selection patterns for distractors in MCQs. Additionally, when LLMs make mistakes, they are more likley to select the same incorrect answers that commonly mislead students, which is a pattern consistent across both small and large language models. Our work provides empirical evidence that despite LLMs’ strong performance on generating educational content, there remains a gap between LLM’s underlying reasoning process and human cognitive processes in identifying confusing distractors. Our findings also have significant implications for educational assessment development. The smaller language models could be efficiently utilized for automated distractor generation as they demonstrate similar patterns in identifying confusing answer choices as larger language models. This observed alignment between LLMs and student misconception patterns opens new opportunities for generating high-quality distractors that complement traditional human-designed distractors.
zh

[NLP-71] Chain-of-Rank: Enhancing Large Language Models for Domain-Specific RAG in Edge Device NAACL2025

【速读】：该论文旨在解决在资源受限环境下（如边缘设备），领域专用的检索增强生成模型（Domain-specific Retrieval-Augmented Generation, RAG）在利用有限规模的大语言模型（LLMs）时，由于计算复杂度高且难以学习复杂的推理过程（如chain-of-thought, CoT），导致性能下降的问题。论文的关键解决方案是提出链式排名（Chain of Rank, CoR），将重点从复杂的长推理转移到简单地对输入外部文档的可靠性进行排序，从而在保持高精度的同时降低计算复杂性，使其特别适用于资源受限环境。

链接: https://arxiv.org/abs/2502.15134
作者: Juntae Lee,Jihwan Bang,Seunghan Yang,Kyuhong Shim,Simyung Chang
机构: Qualcomm AI Research(高通人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 (Findings)

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) with large language models (LLMs) is especially valuable in specialized domains, where precision is critical. To more specialize the LLMs into a target domain, domain-specific RAG has recently been developed by allowing the LLM to access the target domain early via finetuning. The domain-specific RAG makes more sense in resource-constrained environments like edge devices, as they should perform a specific task (e.g. personalization) reliably using only small-scale LLMs. While the domain-specific RAG is well-aligned with edge devices in this respect, it often relies on widely-used reasoning techniques like chain-of-thought (CoT). The reasoning step is useful to understand the given external knowledge, and yet it is computationally expensive and difficult for small-scale LLMs to learn it. Tackling this, we propose the Chain of Rank (CoR) which shifts the focus from intricate lengthy reasoning to simple ranking of the reliability of input external documents. Then, CoR reduces computational complexity while maintaining high accuracy, making it particularly suited for resource-constrained environments. We attain the state-of-the-art (SOTA) results in benchmarks, and analyze its efficacy.
zh

[NLP-72] CoT-ICL Lab: A Petri Dish for Studying Chain-of-Thought Learning from In-Context Demonstrations

【速读】：本文旨在解决在语言模型中理解和利用链式思考（Chain-of-Thought, CoT）进行即时上下文学习（In-Context Learning, ICL）的问题。关键在于开发了一个名为CoT-ICL Lab的框架与方法，该方法通过解耦因果结构和底层令牌处理函数来实现对即时上下文示例复杂度的精细控制。通过训练多达7亿参数的解码器-only变换器模型，研究发现CoT能够加速不同规模模型的准确性提升。特别指出的是，模型深度对于有限上下文示例下利用CoT至关重要，而更多的示例有助于较浅模型达到更深模型的性能水平。此外，限制训练过程中令牌处理函数的多样性可以改进通过ICL进行的因果结构学习。总体而言，CoT-ICL Lab提供了一个简单而强大的实验平台，用于理论和实证探索ICL及CoT在语言模型中的应用。

链接: https://arxiv.org/abs/2502.15132
作者: Vignesh Kothapalli,Hamed Firooz,Maziar Sanjabi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 27 figures, 3 tables

点击查看摘要

Abstract:We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of-thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.
zh

[NLP-73] Unveiling Reasoning Thresholds in Language Models: Scaling Fine-Tuning and Interpretability through Attention Maps

【速读】：该论文旨在探究不同规模和训练数据的解码器-only变压器模型（decoder-only transformer-based language models）在情境学习（in-context learning）中的推理能力，特别是关注超过特定参数阈值（约16亿参数）后模型表现的显著提升。论文的关键解决方案在于发现并验证了通过任务相关的示例微调（fine-tuning with task-specific exemplars），可以显著提高参数阈值以下模型的推理性能，使其即使在提示中不包含额外示例的情况下也能生成正确的链式思考（chain-of-thought, CoT）。此外，通过对注意力图谱（attention maps）的分析，揭示了能够生成正确CoT的模型在相关正确标记上的更高令牌级注意力分数（token-level attention scores），从而提供了对推理过程的可解释性见解。这些发现共同推进了对解码器-only变压器模型推理能力的理解。

链接: https://arxiv.org/abs/2502.15120
作者: Yen-Che Hsiao,Abhishek Dutta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even without additional exemplars in the prompt for tasks with shorter reasoning chains. Finally, our analysis of attention maps reveals that models capable of generating correct CoTs exhibit higher token-level attention scores on subsequent correct tokens and the correct parts of speech, providing interpretability insights into reasoning processes. These findings collectively advance understanding of reasoning capabilities in decoder-only transformer-based models. The code is available at: this https URL.
zh

[NLP-74] Social Genome: Grounded Social Reasoning Abilities of Multimodal Models

【速读】：该论文旨在解决多模态模型在细粒度、基于证据的社会推理能力评估方面的不足。关键在于引入了Social Genome基准数据集，其中包含272个互动视频和1,486个人类标注的推理轨迹，涵盖了从视觉、语言、声音线索及外部知识中的5,777个推理步骤。这是首个研究社会推理中外部知识使用的建模挑战，并通过综合评估模型生成的社会推理轨迹的语义和结构质量来衡量其性能。

链接: https://arxiv.org/abs/2502.15109
作者: Leena Mathur,Marian Qian,Paul Pu Liang,Louis-Philippe Morency
机构: Carnegie Mellon University(卡内基梅隆大学); Massachusetts Institute of Technology(麻省理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Review, 22 pages

点击查看摘要

Abstract:Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce Social Genome, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. Social Genome contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferences about these interactions. These traces contain 5,777 reasoning steps that reference evidence from visual cues, verbal cues, vocal cues, and external knowledge (contextual knowledge external to videos). Social Genome is also the first modeling challenge to study external knowledge in social reasoning. Social Genome computes metrics to holistically evaluate semantic and structural qualities of model-generated social reasoning traces. We demonstrate the utility of Social Genome through experiments with state-of-the-art models, identifying performance gaps and opportunities for future research to improve the grounded social reasoning abilities of multimodal models.
zh

[NLP-75] LUME: LLM Unlearning with Multitask Evaluations

【速读】：该论文旨在解决从大型语言模型（Large Language Models, LLMs）中删除版权、敏感或私有内容的问题，而无需进行全面重新训练。解决方案的关键在于开发了一个多任务清除基准（LUME），其中包括三个任务：清除合成创作的短篇小说、清除包含敏感信息的合成传记以及清除公共传记集合。通过使用两个具有10亿和70亿参数规模的微调LLM作为目标模型，并评估几种最近提出的清除算法，论文展示了如何理解和应对这些算法的行为和局限性。

链接: https://arxiv.org/abs/2502.15097
作者: Anil Ramakrishna,Yixin Wan,Xiaomeng Jin,Kai-Wei Chang,Zhiqi Bu,Bhanukiran Vinzamuri,Volkan Cevher,Mingyi Hong,Rahul Gupta
机构: Amazon AGI(亚马逊AGI); UCLA(加州大学洛杉矶分校); UIUC(伊利诺伊大学香槟分校); EPFL(瑞士联邦理工学院); University of Minnesota(明尼苏达大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unlearning aims to remove copyrighted, sensitive, or private content from large language models (LLMs) without a full retraining. In this work, we develop a multi-task unlearning benchmark (LUME) which features three tasks: (1) unlearn synthetically generated creative short novels, (2) unlearn synthetic biographies with sensitive information, and (3) unlearn a collection of public biographies. We further release two fine-tuned LLMs of 1B and 7B parameter sizes as the target models. We conduct detailed evaluations of several recently proposed unlearning algorithms and present results on carefully crafted metrics to understand their behavior and limitations.
zh

[NLP-76] Judging It Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）在评估和粉饰企业气候披露中的应用。论文的关键解决方案包括：首先，提出了一种基于LLM的法官方法（LLMJ），用于评分公司提交的减排目标和进展报告；其次，探讨了在准确性和长度约束下，LLM被提示进行粉饰回应的行为；最后，验证了LLMJ评分系统在面对可能由LLM粉饰的回应时的鲁棒性。研究发现，两种LLMJ评分系统——数值评级和成对比较，均能有效区分表现优异的公司与其他公司，而成对比较系统在对抗LLM粉饰回应方面表现出更高的鲁棒性。

链接: https://arxiv.org/abs/2502.15094
作者: Marianne Chuang,Gabriel Chuang,Cheryl Chuang,John Chuang
机构: UC Santa Cruz; Columbia University; UC Berkeley
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:We study the use of large language models (LLMs) to both evaluate and greenwash corporate climate disclosures. First, we investigate the use of the LLM-as-a-Judge (LLMJ) methodology for scoring company-submitted reports on emissions reduction targets and progress. Second, we probe the behavior of an LLM when it is prompted to greenwash a response subject to accuracy and length constraints. Finally, we test the robustness of the LLMJ methodology against responses that may be greenwashed using an LLM. We find that two LLMJ scoring systems, numerical rating and pairwise comparison, are effective in distinguishing high-performing companies from others, with the pairwise comparison system showing greater robustness against LLM-greenwashed responses.
zh

[NLP-77] Optimizing Singular Spectrum for Large Language Model Compression

【速读】：该论文旨在解决大型语言模型（LLMs）在部署过程中因参数复杂性高而导致的问题。论文的关键在于提出SoCo框架，通过学习重新缩放奇异谱中的分解组件，采用数据驱动的方法来分配重要性分数，从而实现从粗略压缩到细粒度稀疏化的渐进式训练过程。这种方法能够有效平衡模型压缩的强度与性能保持之间的关系，并通过自适应剪枝和放大剩余组件的重要性分数来补偿被剪枝组件的损失。

链接: https://arxiv.org/abs/2502.15092
作者: Dengjie Li,Tiancheng Shen,Yao Zhou,Baisong Yang,Zhongying Liu,Masheng Yang,Bernard Ghanem,Yibo Yang,Yujie Zhong,Ming-Hsuan Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, yet prohibitive parameter complexity often hinders their deployment. Existing singular value decomposition (SVD) based compression methods simply deem singular values as importance scores of decomposed components. However, this importance ordered by singular values does not necessarily correlate with the performance of a downstream task. In this work, we introduce SoCo (Singular spectrum optimization for large language model Compression), a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner. Concretely, we employ a learnable diagonal matrix to assign importance scores for singular spectrum and develop a three-stage training process that progressively refines these scores from initial coarse compression to fine-grained sparsification-thereby striking an effective balance between aggressive model compression and performance preservation. Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores, rather than relying on the fixed order of singular values. More importantly, the remaining components with amplified importance scores can compensate for the loss of the pruned ones. Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
zh

[NLP-78] Analyze the Neurons not the Embeddings: Understanding When and Where LLM Representations Align with Humans

【速读】：该论文旨在探究现代大型语言模型（Large Language Models, LLMs）所学表示与人类表示之间的对齐程度。论文的关键解决方案在于引入了一种新颖的方法，通过采用激活引导（activation steering）研究中的方法来识别负责特定概念（如“猫”）的神经元，并分析相应的激活模式。这种方法揭示了LLMs的表示与从行为数据推断出的人类表示之间存在紧密对齐，且这种对齐超过了之前工作中以词嵌入为中心的研究。此外，该方法还提供了对LLMs如何表示概念的更细致视角。

链接: https://arxiv.org/abs/2502.15090
作者: Masha Fedzechkina,Eleonora Gualdoni,Sinead Williamson,Katherine Metcalf,Skyler Seto,Barry-John Theobald
机构: Apple
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern large language models (LLMs) achieve impressive performance on some tasks, while exhibiting distinctly non-human-like behaviors on others. This raises the question of how well the LLM’s learned representations align with human representations. In this work, we introduce a novel approach to the study of representation alignment: we adopt a method from research on activation steering to identify neurons responsible for specific concepts (e.g., ‘cat’) and then analyze the corresponding activation patterns. Our findings reveal that LLM representations closely align with human representations inferred from behavioral data. Notably, this alignment surpasses that of word embeddings, which have been center stage in prior work on human and model alignment. Additionally, our approach enables a more granular view of how LLMs represent concepts. Specifically, we show that LLMs organize concepts in a way that reflects hierarchical relationships interpretable to humans (e.g., ‘animal’-‘dog’).
zh

[NLP-79] Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在考虑用户特定安全标准时的安全性评估问题。目前缺乏相应的基准数据集来评估LLM的用户特定安全性。为填补这一空白，论文引入了U-SAFEBENCH基准，用于评估LLM的用户特定安全方面。研究发现当前LLM在考虑用户特定安全标准时未能表现安全行为，这是一项新发现。论文的关键解决方案是提出了一种基于思维链（chain-of-thought）的简单修复方法，并证明其在提高用户特定安全性方面的有效性。

链接: https://arxiv.org/abs/2502.15086
作者: Yeonjun In,Wonjoong Kim,Kanghoon Yoon,Sungchul Kim,Mehrab Tanjim,Kibum Kim,Chanyoung Park
机构: KAIST; Adobe Research
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SAFEBENCH, the first benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at this https URL.
zh

[NLP-80] UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning

【速读】：该论文旨在解决在删除预训练模型（如大规模语言模型 LLMs）中的指定数据后，如何平衡信息删除与保持模型其他能力的问题。论文的关键解决方案是提出了一种名为UPCORE (Utility-Preserving Coreset Selection) 的方法无关的数据选择框架，通过选择性剪枝遗忘集合中的异常点来减少模型降解，从而缓解在无学习（unlearning）过程中产生的附带损害。这种方法能够在删除效果和模型保全之间取得更优的平衡。

链接: https://arxiv.org/abs/2502.15082
作者: Vaidehi Patil,Elias Stengel-Eskin,Mohit Bansal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or “forgetting” a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model’s other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model’s representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. We evaluate UPCORE across three standard unlearning methods consistently achieving a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. We find that UPCORE improves both standard metrics and AUC, benefitting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.
zh

[NLP-81] Can Hallucination Correction Improve Video-Language Alignment?

【速读】：该论文旨在解决大型视觉-语言模型在生成描述时出现与视觉输入不一致的幻觉内容的问题。解决方案的关键在于引入HACA（Hallucination Correction via Alignment），这是一种自训练框架，通过学习纠正那些与视频内容不匹配的描述中的不一致性，从而增强模型在时空推理中对视频和文本表示的对齐能力。

链接: https://arxiv.org/abs/2502.15079
作者: Lingjun Zhao,Mingyang Xie,Paola Cascante-Bonilla,Hal Daumé III,Kwonjoon Lee
机构: University of Maryland(马里兰大学), College Park; Stony Brook University(石溪大学); Honda Research Institute(本田研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models often generate hallucinated content that is not grounded in its visual inputs. While prior work focuses on mitigating hallucinations, we instead explore leveraging hallucination correction as a training objective to improve video-language alignment. We introduce HACA, a self-training framework learning to correct hallucinations in descriptions that do not align with the video content. By identifying and correcting inconsistencies, HACA enhances the model’s ability to align video and textual representations for spatio-temporal reasoning. Our experimental results show consistent gains in video-caption binding and text-to-video retrieval tasks, demonstrating that hallucination correction-inspired tasks serve as an effective strategy for improving vision and language alignment.
zh

[NLP-82] Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilsons Disease

【速读】：该论文旨在解决大型语言模型（LLMs）在识别罕见疾病方面的有效性问题。当前临床决策支持系统因缺乏常见疾病的认知及使用难度，其效用受限。为解决这一问题，论文提出RareScale方法，结合专家系统与LLMs的知识，通过联合使用专家系统和LLM模拟罕见疾病的对话数据，并用于训练候选疾病预测模型。这些候选疾病作为额外输入，辅助黑盒LLM进行最终的鉴别诊断。关键在于平衡罕见疾病与常见疾病的诊断能力，从而显著提升了黑盒LLM在Top-5准确性上的基准性能，增幅超过17%，同时候选生成的性能也高达88.8%。

链接: https://arxiv.org/abs/2502.15069
作者: Elliot Schumacher,Dhruv Naik,Anitha Kannan
机构: Curai Health
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use. In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson’s Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats). Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.15069 [cs.CL] (or arXiv:2502.15069v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.15069 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-83] Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation ALT AAAI’25

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在处理视觉和文本任务时存在的幻觉（hallucination）问题，特别是在医疗保健领域，细节准确性至关重要。解决方案的关键在于引入Visual RAG（V-RAG），这是一种结合检索文本和图像数据的增强型生成框架。通过在MIMIC-CXR胸部X光报告生成和Multicare医学图像描述生成数据集上的实验，证明了V-RAG能够提高实体探针（entity probing）的准确性，从而纠正幻觉并生成更准确的临床X光报告，最终获得更高的RadGraph-F1评分。

链接: https://arxiv.org/abs/2502.15040
作者: Yun-Wei Chu,Kai Zhang,Christopher Malon,Martin Renqiang Min
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: GenAI4Health - AAAI '25

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown impressive performance in vision and text tasks. However, hallucination remains a major challenge, especially in fields like healthcare where details are critical. In this work, we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a retrieval-augmented generation framework that incorporates both text and visual data from retrieved images. On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation datasets, we show that Visual RAG improves the accuracy of entity probing, which asks whether a medical entities is grounded by an image. We show that the improvements extend both to frequent and rare entities, the latter of which may have less positive training data. Downstream, we apply V-RAG with entity probing to correct hallucinations and generate more clinically accurate X-ray reports, obtaining a higher RadGraph-F1 score.
zh

[NLP-84] InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

【速读】：该论文旨在解决现有基准测试未能充分评估大型多模态模型（Large Multimodal Models, LMMs）与人类用户互动智能的问题，这是开发通用人工智能助手所必需的。为了解决这一问题，论文提出了一种名为InterFeedback的交互框架，该框架能够应用于任何LMM和数据集，以自主评估这种能力。此外，论文还介绍了InterFeedback-Bench，通过两个代表性数据集（MMMU-Pro和MathVerse）来评估10种不同的开源LMM的交互智能。关键在于这一交互框架及其评估方法能够系统性地检验和提升LMMs处理人类反馈的能力。

链接: https://arxiv.org/abs/2502.15027
作者: Henry Hengyuan Zhao,Wenqi Pei,Yifei Tao,Haiyang Mei,Mike Zheng Shou
机构: Show Lab, National University of Singapore (秀实验室，新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art LMM (like OpenAI-o1) can correct their results through human feedback less than 50%. Our findings point to the need for methods that can enhance the LMMs’ capability to interpret and benefit from feedback.
zh

[NLP-85] A Meta-Evaluation of Style and Attribute Transfer Metrics

【速读】：该论文旨在解决风格与属性迁移中的内容保真度评估问题。现有评估方法主要依赖于词法或语义相似性度量，但这些方法无法有效区分风格和内容的变化。论文的关键解决方案是提出了一种新的零样本评估方法，通过计算下一个令牌（token）的可能性来衡量内容保真度，并强调这种评估必须相对于风格变化进行条件化。这一新方法旨在更有效地反映人类判断，以确保风格迁移方法的公平评估。

链接: https://arxiv.org/abs/2502.15022
作者: Amalie Brogaard Pauli,Isabelle Augenstein,Ira Assent
机构: Department of Computer Science, Aarhus University(计算机科学系,奥胡斯大学); Department of Computer Science, University of Copenhagen(计算机科学系,哥本哈根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs make it easy to rewrite text in any style, be it more polite, persuasive, or more positive. We present a large-scale study of evaluation metrics for style and attribute transfer with a focus on content preservation; meaning content not attributed to the style shift is preserved. The de facto evaluation approach uses lexical or semantic similarity metrics often between source sentences and rewrites. While these metrics are not designed to distinguish between style or content differences, empirical meta-evaluation shows a reasonable correlation to human judgment. In fact, recent works find that LLMs prompted as evaluators are only comparable to semantic similarity metrics, even though intuitively, the LLM approach should better fit the task. To investigate this discrepancy, we benchmark 8 metrics for evaluating content preservation on existing datasets and additionally construct a new test set that better aligns with the meta-evaluation aim. Indeed, we then find that the empirical conclusion aligns with the intuition: content preservation metrics for style/attribute transfer must be conditional on the style shift. To support this, we propose a new efficient zero-shot evaluation method using the likelihood of the next token. We hope our meta-evaluation can foster more research on evaluating content preservation metrics, and also to ensure fair evaluation of methods for conducting style transfer.
zh

[NLP-86] Using tournaments to calculate AUROC for zero-shot classification with LLM s

【速读】：该论文旨在解决大型语言模型（LLMs）在零样本分类任务中的性能评估难题，特别是由于缺乏可修改的决策边界而难以与监督分类器进行公平比较的问题。解决方案的关键在于将二元分类任务转换为成对比较任务，并利用生成的相对排名通过Elo评分系统对实例进行评分，从而诱导出数据集中实例的置信度排序。此外，论文还评估了调度算法以最小化比较次数，表明所提出的算法不仅能提升分类性能，还能提供比传统零样本分类更多的信息。

链接: https://arxiv.org/abs/2502.15018
作者: Wonjin Yoon,Ian Bulovic,Timothy A. Miller
机构: Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models perform surprisingly well on many zero-shot classification tasks, but are difficult to fairly compare to supervised classifiers due to the lack of a modifiable decision boundary. In this work, we propose and evaluate a method that converts binary classification tasks into pairwise comparison tasks, obtaining relative rankings from LLMs. Repeated pairwise comparisons can be used to score instances using the Elo rating system (used in chess and other competitions), inducing a confidence ordering over instances in a dataset. We evaluate scheduling algorithms for their ability to minimize comparisons, and show that our proposed algorithm leads to improved classification performance, while also providing more information than traditional zero-shot classification.
zh

[NLP-87] Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models

【速读】：该论文旨在解决语言模型在使用过程中对受版权保护内容的直接复制问题。现有方法主要依赖于完全概念移除或简单的输出过滤，但这些方法要么过于激进，要么效果有限。论文提出的关键解决方案是Obliviate，这是一种新颖的训练后技术，能够选择性地防止特定文本的逐字复制，同时保持语义理解能力。Obliviate通过选择记忆序列中的标记，并调整模型的概率分布来实现这一目标，从而在不损害模型整体性能的情况下有效减少版权内容的逐字再现，如实验结果所示，在标准基准测试中性能仅下降不到1%。

链接: https://arxiv.org/abs/2502.15010
作者: Mark Russinovich,Ahmed Salem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent copyright agreements between AI companies and content creators have highlighted the need for precise control over language models’ ability to reproduce copyrighted content. While existing approaches rely on either complete concept removal through unlearning or simple output filtering, we propose Obliviate, a novel post-training technique that selectively prevents verbatim reproduction of specific text while preserving semantic understanding. Obliviate operates by selecting tokens within memorized sequences and modifying the model’s probability distribution to prevent exact reproduction while maintaining contextual understanding. We evaluate Obliviate on multiple large language models (LLaMA-3.1 8B, LLaMA-3.1-instruct 8B, Qwen-2.5-7B, and Yi-1.5 6B) across both synthetic memorization tasks and organic copyright content. Our results demonstrate that Obliviate achieves orders of magnitude reduction, e.g., 100x, in verbatim memorization while maintaining model performance within 1% of baseline on standard benchmarks (HellaSwag, MMLU, TruthfulQA, and Winogrande). This makes Obliviate particularly suitable for practical deployment scenarios where companies need to efficiently address copyright concerns in pretrained models without compromising their general capabilities. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2502.15010 [cs.CL] (or arXiv:2502.15010v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.15010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-88] Contextualizing Search Queries In-Context Learning for Conversational Rewriting with LLM s

【速读】：该论文旨在解决在低资源环境下，传统的监督方法因缺乏大量标注数据而难以有效进行会话查询改写的问题。解决方案的关键在于提出了一种名为提示引导的上下文学习（Prompt-Guided In-Context Learning）的方法，通过精心设计的提示（prompts），结合任务描述、输入输出格式规范以及少量示例，指导预训练的大规模语言模型（Large Language Models, LLMs）生成上下文无关的查询，而无需显式的微调过程。

链接: https://arxiv.org/abs/2502.15009
作者: Raymond Wilson,Chase Carter,Cole Graham
机构: National Energy University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational query rewriting is crucial for effective conversational search, yet traditional supervised methods require substantial labeled data, which is scarce in low-resource settings. This paper introduces Prompt-Guided In-Context Learning, a novel approach that leverages the in-context learning capabilities of Large Language Models (LLMs) for few-shot conversational query rewriting. Our method employs carefully designed prompts, incorporating task descriptions, input/output format specifications, and a small set of illustrative examples, to guide pre-trained LLMs to generate context-independent queries without explicit fine-tuning. Extensive experiments on benchmark datasets, TREC and Taskmaster-1, demonstrate that our approach significantly outperforms strong baselines, including supervised models and contrastive co-training methods, across various evaluation metrics such as BLEU, ROUGE-L, Success Rate, and MRR. Ablation studies confirm the importance of in-context examples, and human evaluations further validate the superior fluency, relevance, and context utilization of our generated rewrites. The results highlight the potential of prompt-guided in-context learning as an efficient and effective paradigm for low-resource conversational query rewriting, reducing the reliance on extensive labeled data and complex training procedures.
zh

[NLP-89] LLM -Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers NAACL2025

【速读】：该论文旨在量化大型语言模型（Large Language Models, LLMs）如何编码和存储上下文信息，并揭示被视为次要的标记（如限定词、标点符号）实际上承载着重要的上下文。论文的关键解决方案在于引入LLM-Microscope工具包，该工具包通过评估标记级别的非线性、评估上下文记忆、可视化中间层贡献以及测量表示的内在维度性，揭示了看似微不足道的标记对于长距离理解的重要性。

链接: https://arxiv.org/abs/2502.15007
作者: Anton Razzhigaev,Matvey Mikhalchuk,Temurbek Rahmatullaev,Elizaveta Goncharova,Polina Druzhinina,Ivan Oseledets,Andrey Kuznetsov
机构: AIRI; Skoltech; HSE University; Lomonosov Moscow State University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted to NAACL 2025

点击查看摘要

Abstract:We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens – especially stopwords, articles, and commas – consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis also shows a strong correlation between contextualization and linearity, where linearity measures how closely the transformation from one layer’s embeddings to the next can be approximated by a single linear mapping. These findings underscore the hidden importance of filler tokens in maintaining context. For further exploration, we present LLM-Microscope, an open-source toolkit that assesses token-level nonlinearity, evaluates contextual memory, visualizes intermediate layer contributions (via an adapted Logit Lens), and measures the intrinsic dimensionality of representations. This toolkit illuminates how seemingly trivial tokens can be critical for long-range understanding.
zh

[NLP-90] A Socratic RAG Approach to Connect Natural Language Queries on Research Topics with Knowledge Organization Systems AAAI2025

【速读】：该论文旨在解决如何将用户的自然语言查询关于研究主题映射到精确且机器可解析的语义实体，并有效连接“小语义”（特定领域的知识组织结构）与“大语义”（广泛的文献计量存储库），从而使得复杂的学术分类体系更加易于访问。论文的关键解决方案在于提出了一种结合检索增强生成（Retrieval Augmented Generation, RAG）与苏格拉底对话方法的代理（agent），以对齐用户对研究主题的直观理解与已建立的知识组织系统。论文通过一个名为CollabNext的应用实例来展示这一方法，该应用聚焦于个人为中心的知识图谱，连接人员、机构及研究主题，并特别关注历史悠久黑人学院和大学（HBCUs）以及新兴研究人员，以提高历史上在科学系统中被边缘化人群的可见度。

链接: https://arxiv.org/abs/2502.15005
作者: Lew Lefton,Kexin Rong,Chinar Dankhara,Lila Ghemri,Firdous Kausar,A. Hannibal Hamdallahi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 2 figures, AAAI 2025 Workshop on A Translational Institute for Knowledge Axiomatization

点击查看摘要

Abstract:In this paper, we propose a Retrieval Augmented Generation (RAG) agent that maps natural language queries about research topics to precise, machine-interpretable semantic entities. Our approach combines RAG with Socratic dialogue to align a user’s intuitive understanding of research topics with established Knowledge Organization Systems (KOSs). The proposed approach will effectively bridge “little semantics” (domain-specific KOS structures) with “big semantics” (broad bibliometric repositories), making complex academic taxonomies more accessible. Such agents have the potential for broad use. We illustrate with a sample application called CollabNext, which is a person-centric knowledge graph connecting people, organizations, and research topics. We further describe how the application design has an intentional focus on HBCUs and emerging researchers to raise visibility of people historically rendered invisible in the current science system.
zh

[NLP-91] Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理情感边界方面的能力评估问题。解决方案的关键在于提出一个开源基准测试和评估框架，通过分析七种关键模式（直接拒绝、道歉、解释、转移话题、承认、设定边界和情绪感知）来评估LLMs在六种语言中的表现。该框架使用包含1156个提示的数据集，对三个领先的LLMs（GPT-4o、Claude-3.5 Sonnet和Mistral-large）进行了系统性评估。

链接: https://arxiv.org/abs/2502.14975
作者: David Noever,Grant Rosario
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an open-source benchmark and evaluation framework for assessing emotional boundary handling in Large Language Models (LLMs). Using a dataset of 1156 prompts across six languages, we evaluated three leading LLMs (GPT-4o, Claude-3.5 Sonnet, and Mistral-large) on their ability to maintain appropriate emotional boundaries through pattern-matched response analysis. Our framework quantifies responses across seven key patterns: direct refusal, apology, explanation, deflection, acknowledgment, boundary setting, and emotional awareness. Results demonstrate significant variation in boundary-handling approaches, with Claude-3.5 achieving the highest overall score (8.69/10) and producing longer, more nuanced responses (86.51 words on average). We identified a substantial performance gap between English (average score 25.62) and non-English interactions ( 0.22), with English responses showing markedly higher refusal rates (43.20% vs. 1% for non-English). Pattern analysis revealed model-specific strategies, such as Mistral’s preference for deflection (4.2%) and consistently low empathy scores across all models ( 0.06). Limitations include potential oversimplification through pattern matching, lack of contextual understanding in response analysis, and binary classification of complex emotional responses. Future work should explore more nuanced scoring methods, expand language coverage, and investigate cultural variations in emotional boundary expectations. Our benchmark and methodology provide a foundation for systematic evaluation of LLM emotional intelligence and boundary-setting capabilities.
zh

[NLP-92] Lost in Space: Optimizing Tokens for Grammar-Constrained Decoding

【速读】：该论文旨在探讨在自然语言处理（NLP）任务中，不同结构化输出格式对大规模语言模型（LLM）性能的影响。论文的关键在于通过实验对比分析，发现当模型被指示使用包含前导空白字符的标记进行输出时，其性能可提升5%-10%，并且对于较小规模的模型，这种效果更为显著。此外，所有模型在分类任务中使用实际数值作为输出时表现最为准确。这些发现为研究者在使用语言模型作为零样本分类器时提供了最佳实践建议。

链接: https://arxiv.org/abs/2502.14969
作者: Sil Hamilton,David Mimno
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:General-purpose language models are trained to produce varied natural language outputs, but for some tasks like annotation or classification we need more specific output formats. LLM systems increasingly support structured output, sampling tokens according to a grammar, which enforces a format but which can also reduce performance. We ask whether there are systematic differences between grammars that appear semantically similar to humans. To answer this question, we test four popular model families with five token formats on four NLP benchmarks. All models perform most accurately when instructed to classify with real numbers. Performance also improves by 5%-10% when models are instructed to return tokens incorporating leading whitespace, which we find can help models avoid structural deficiencies in subword token representations. Format-based differences are largest for smaller models that are often used for local laptop-scale inference. We present best practices for researchers using language models as zero-shot classifiers with structured output.
zh

[NLP-93] KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding ACL2025

【速读】：该论文旨在解决阿拉伯文光学字符识别（Arabic OCR）面临的独特挑战，包括其草书体、从右至左的文字流向以及复杂的印刷和书法特征。论文的关键解决方案是提出了KITAB-Bench，这是一个涵盖9个主要领域和36个子领域的综合阿拉伯文OCR基准测试集，包含8,809个样本，涵盖了手写文本、结构化表格以及商业智能中21种图表类型的专门覆盖。实验结果表明，现代视觉-语言模型（如GPT-4、Gemini和Qwen）在字符错误率（CER）方面比传统OCR方法（如EasyOCR、PaddleOCR和Surya）平均高出60%，从而显著提升了阿拉伯文OCR的性能。

链接: https://arxiv.org/abs/2502.14949
作者: Ahmed Heakl,Abdullah Sohail,Mukul Ranjan,Rania Hossam,Ghazi Ahmed,Mohamed El-Geish,Omar Maher,Zhiqiang Shen,Fahad Khan,Salman Khan
机构: MBZUAI; Monta AI; Linköping University; Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 17 pages, 5 figures, ACL 2025

点击查看摘要

Abstract:With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
zh

[NLP-94] GenAI vs. Human Fact-Checkers: Accurate Ratings Flawed Rationales

【速读】：该论文旨在评估生成式人工智能（Generative AI, GenAI）模型在评估和推理信息可信度方面的能力与局限。研究通过任务评价多个GenAI模型，这些任务涉及对子国家层面美国政治家在Facebook上发布内容的可信度评分及推理。研究发现，尽管GPT-4o在消费者应用中最常用且表现最佳，但所有模型与人工编码员的中等一致程度表明其能力有限。关键在于，即使GenAI模型能够准确识别低可信度内容，它们的推理主要依赖于语言特征和“硬性”标准，如细节水平、来源可靠性和语言正式性，而非对真实性本身的理解。此外，研究还探讨了摘要输入与完整内容输入的有效性，结果表明摘要输入有潜力在不牺牲准确性的情况下提高效率。论文强调，虽然GenAI有可能辅助人类事实核查员扩展误信息检测规模，但完全依赖这些模型的做法应持谨慎态度。

链接: https://arxiv.org/abs/2502.14943
作者: Yuehong Cassandra Tai,Khushi Navin Patni,Nicholas Daniel Hemauer,Bruce Desmarais,Yu-Ru Lin
机构: Penn State University (宾夕法尼亚州立大学); University of Pittsburgh (匹兹堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for publication in the 17th ACM Web Science Conference 2025

点击查看摘要

Abstract:Despite recent advances in understanding the capabilities and limits of generative artificial intelligence (GenAI) models, we are just beginning to understand their capacity to assess and reason about the veracity of content. We evaluate multiple GenAI models across tasks that involve the rating of, and perceived reasoning about, the credibility of information. The information in our experiments comes from content that subnational U.S. politicians post to Facebook. We find that GPT-4o, one of the most used AI models in consumer applications, outperforms other models, but all models exhibit only moderate agreement with human coders. Importantly, even when GenAI models accurately identify low-credibility content, their reasoning relies heavily on linguistic features and ``hard’’ criteria, such as the level of detail, source reliability, and language formality, rather than an understanding of veracity. We also assess the effectiveness of summarized versus full content inputs, finding that summarized content holds promise for improving efficiency without sacrificing accuracy. While GenAI has the potential to support human fact-checkers in scaling misinformation detection, our results caution against relying solely on these models.
zh

[NLP-95] Learning to Retrieve and Reason on Knowledge Graph through Active Self-Reflection

【速读】：该论文旨在解决大型语言模型（LLMs）在利用结构化图知识进行推理时表现不佳的问题，并提出了一种新的反馈机制以提高整个推理路径中的反思与修正能力。关键在于引入了一个端到端训练的主动自省框架（ARG），通过特殊标记主动决定是否需要检索知识，并基于检索到的知识进行反思性批判，从而实现基于结构化图的迭代推理。这一方法显著提高了推理路径的可解释性，并在知识图谱推理任务中取得了优于现有基准模型的成果。

链接: https://arxiv.org/abs/2502.14932
作者: Han Zhang,Langshi Zhou,Hanfang Yang
机构: Center for Applied Statistics, Renmin University of China (中国人民大学应用统计研究中心); School of Statistics, Renmin University of China (中国人民大学统计学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extensive research has investigated the integration of large language models (LLMs) with knowledge graphs to enhance the reasoning process. However, understanding how models perform reasoning utilizing structured graph knowledge remains underexplored. Most existing approaches rely on LLMs or retrievers to make binary judgments regarding the utilization of knowledge, which is too coarse. Meanwhile, there is still a lack of feedback mechanisms for reflection and correction throughout the entire reasoning path. This paper proposes an Active self-Reflection framework for knowledge Graph reasoning ARG, introducing for the first time an end-to-end training approach to achieve iterative reasoning grounded on structured graphs. Within the framework, the model leverages special tokens to \textitactively determine whether knowledge retrieval is necessary, performs \textitreflective critique based on the retrieved knowledge, and iteratively reasons over the knowledge graph. The reasoning paths generated by the model exhibit high interpretability, enabling deeper exploration of the model’s understanding of structured knowledge. Ultimately, the proposed model achieves outstanding results compared to existing baselines in knowledge graph reasoning tasks.
zh

[NLP-96] A Tale of Two Structures: Do LLM s Capture the Fractal Complexity of Language?

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）是否能够再现自然语言的分形特性，并识别在何种条件下（如温度设置和提示方法）它们可能会失败。关键在于通过分析LLMs生成文本的分形参数与自然语言的分形参数之间的差异，来检测LLMs生成的非平凡部分文本。论文通过广泛的实验验证了这一发现对于不同架构的模型（如Gemini 1.0 Pro、Mistral-7B和Gemma-2B）具有鲁棒性，并发布了一个包含超过240,000篇文章的数据集，这些文章由不同的LLMs以不同解码温度和提示方法生成，并附有人类生成的对照文本。

链接: https://arxiv.org/abs/2502.14924
作者: Ibrahim Alabdulmohsin,Andreas Steiner
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs’ output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising of over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.
zh

[NLP-97] AI Thinking as a Meaning-Centered Framework: Reimagining Language Technologies Through Community Agency

【速读】：该论文旨在解决当前语言技术在处理复杂社会文化维度的语际保护方面的不足。解决方案的关键在于提出一种以意义为中心的框架，即从为社区创建工具转变为与社区共同创造解决方案，通过文化理解、社区自主性和技术创新的互动来实现有意义的解决方案。此外，该框架还包括一个由五个层次构成的技术生态系统，确保社区对其语言和文化知识表示保持控制。这一系统性整合社区需求、文化保护和高级功能的方法可能彻底改变我们在数字时代保护语言多样性的方式。

链接: https://arxiv.org/abs/2502.14923
作者: Jose F Quesada
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: LT4All 2025. Language Technologies for All - 2025. Advancing Humanism through Language Technologies. Paris (FR), UNESCO Headquarters, 24-26 February 2025

点击查看摘要

Abstract:While language technologies have advanced significantly, current approaches fail to address the complex sociocultural dimensions of linguistic preservation. AI Thinking proposes a meaning-centered framework that would transform technological development from creating tools FOR communities to co-creating solutions WITH them. This approach recognizes that meaningful solutions emerge through the interplay of cultural understanding, community agency, and technological innovation. The proposal articulates a holistic methodology and a five-layer technological ecosystem where communities maintain control over their linguistic and cultural knowledge representation. This systematic integration of community needs, cultural preservation, and advanced capabilities could revolutionize how we approach linguistic diversity preservation in the digital age.
zh

[NLP-98] SIFT: Grounding LLM Reasoning in Contexts via Stickers

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中对上下文理解不准确的问题，这可能导致从较小模型如Llama3.2-3B-Instruct到最先进模型如DeepSeek-R1的计算错误。论文提出了一种名为“Stick to the Facts (SIFT)”的新型后训练方法来应对这一挑战。SIFT的核心在于利用推理时增加的计算能力将LLM的推理过程与上下文紧密关联。其关键在于Sicker，即由模型自身生成以明确强调上下文中关键信息的元素。SIFT通过生成两个预测——一个来自原始查询，另一个来自附加了Sicker后的查询，并在两者存在差异时，通过正向优化和逆向生成的方式逐步优化Sicker，以确保推理结果更加忠实于实际情境。研究表明，这种方法在不同规模的模型及基准测试中均表现出一致的性能提升，特别是在AIME2024上，SIFT使DeepSeek-R1的pass@1准确性从78.33%提高到了85.67%，确立了开源社区的新标杆。

链接: https://arxiv.org/abs/2502.14922
作者: Zihao Zeng,Xuyao Huang,Boxiu Li,Zhijie Deng
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase “10 dollars per kilo,” LLMs might not recognize that “per” means “for each,” leading to calculation errors. We introduce a novel, post-training approach called Stick to the Facts (SIFT) to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the Sticker, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions – one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via forward optimization (to better align the extracted facts with the query) and inverse generation (to conform with the model’s inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to 85.67%, establishing a new state-of-the-art in the open-source community. The code is available at this https URL.
zh

[NLP-99] he Canarys Echo: Auditing Privacy Risks of LLM -Generated Synthetic Text

【速读】：该论文旨在探究从大规模语言模型（Large Language Models, LLMs）生成的合成数据中能够推断出多少关于训练样本的信息，并设计了一种基于数据的成员推理攻击（Membership Inference Attacks, MIAs），以评估仅发布合成数据时的隐私风险。论文的关键解决方案在于利用自回归模型（auto-regressive models）的机制，设计了一种具有分布内前缀和高困惑度后缀的探测器（canaries），这种探测器能够在生成的合成数据中留下可检测的痕迹，从而增强基于数据的成员推理攻击的效果，提供更准确的隐私风险评估。

链接: https://arxiv.org/abs/2502.14921
作者: Matthieu Meeus,Lukas Wutschitz,Santiago Zanella-Béguelin,Shruti Tople,Reza Shokri
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How much information about training samples can be gleaned from synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we design membership inference attacks (MIAs) that target data used to fine-tune pre-trained LLMs that are then used to synthesize data, particularly when the adversary does not have access to the fine-tuned model but only to the synthetic data. We show that such data-based MIAs do significantly better than a random guess, meaning that synthetic data leaks information about the training data. Further, we find that canaries crafted to maximize vulnerability to model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model’s output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their vulnerability. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.
zh

[NLP-100] MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs

【速读】：该论文旨在解决自动编码中文电子医疗记录（EMRs）中的国际疾病分类（ICD）代码的挑战。主要问题包括从中文EMRs中提取与疾病代码相关的信息困难，以及先前方法未能利用基于疾病的多轴知识且缺乏与相应临床证据的关联。解决方案的关键在于引入了一个名为MKE-Coder的新框架，该框架通过多轴知识与证据验证相结合的方法来提高自动ICD编码的准确性。具体而言，MKE-Coder首先识别诊断候选代码并将其分类到四个编码轴的知识类别中，然后检索相关临床证据并通过评分模型筛选可信证据。最后，采用基于掩码语言建模策略的推理模块确保候选代码的有效性，并提供相应的推荐。

链接: https://arxiv.org/abs/2502.14916
作者: Xinxin You,Xien Liu,Xue Yang,Ziyi Wang,Ji Wu
机构: Tsinghua University (清华大学); Tsinghua-iFlytek Joint Laboratory, Iflytek (清华大学-科大讯飞联合实验室, 科大讯飞)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The task of automatically coding the International Classification of Diseases (ICD) in the medical field has been well-established and has received much attention. Automatic coding of the ICD in the medical field has been successful in English but faces challenges when dealing with Chinese electronic medical records (EMRs). The first issue lies in the difficulty of extracting disease code-related information from Chinese EMRs, primarily due to the concise writing style and specific internal structure of the EMRs. The second problem is that previous methods have failed to leverage the disease-based multi-axial knowledge and lack of association with the corresponding clinical evidence. This paper introduces a novel framework called MKE-Coder: Multi-axial Knowledge with Evidence verification in ICD coding for Chinese EMRs. Initially, we identify candidate codes for the diagnosis and categorize each of them into knowledge under four coding this http URL, we retrieve corresponding clinical evidence from the comprehensive content of EMRs and filter credible evidence through a scoring model. Finally, to ensure the validity of the candidate code, we propose an inference module based on the masked language modeling strategy. This module verifies that all the axis knowledge associated with the candidate code is supported by evidence and provides recommendations accordingly. To evaluate the performance of our framework, we conduct experiments using a large-scale Chinese EMR dataset collected from various hospitals. The experimental results demonstrate that MKE-Coder exhibits significant superiority in the task of automatic ICD coding based on Chinese EMRs. In the practical evaluation of our method within simulated real coding scenarios, it has been demonstrated that our approach significantly aids coders in enhancing both their coding accuracy and speed.
zh

[NLP-101] What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverag e of MLLM s

【速读】：该论文旨在解决现有视觉描述基准测试的局限性，这些基准主要通过过时的指标评估简短描述。论文的关键解决方案是提出了CV-CapBench（Comprehensive Visual Caption Benchmark），一个系统化评估视觉描述质量的基准，涵盖6个视角和13个维度，并引入了精确度、召回率和命中率等新指标，以全面评估正确性和覆盖范围。实验结果揭示了领先多模态大型语言模型在动态和知识密集型维度上的显著能力差距。

链接: https://arxiv.org/abs/2502.14914
作者: Zhihang Liu,Chen-Wei Xie,Bin Wen,Feiwu Yu,Jixuan Chen,Boqiang Zhang,Nianzu Yang,Pandeng Li,Yun Zheng,Hongtao Xie
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have rendered traditional visual captioning benchmarks obsolete, as they primarily evaluate short descriptions with outdated metrics. While recent benchmarks address these limitations by decomposing captions into visual elements and adopting model-based evaluation, they remain incomplete-overlooking critical aspects, while providing vague, non-explanatory scores. To bridge this gap, we propose CV-CapBench, a Comprehensive Visual Caption Benchmark that systematically evaluates caption quality across 6 views and 13 dimensions. CV-CapBench introduces precision, recall, and hit rate metrics for each dimension, uniquely assessing both correctness and coverage. Experiments on leading MLLMs reveal significant capability gaps, particularly in dynamic and knowledge-intensive dimensions. These findings provide actionable insights for future research. The code and data will be released.
zh

[NLP-102] OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment

【速读】：该论文旨在解决多智能体协作大型语言模型（Multi-agent Collaborative Large Language Models, LLMs）在文本转SQL任务中的性能限制问题。这些问题包括框架不完整性、未能遵循指令以及模型幻觉现象。为了解决这些问题，论文提出了一种名为OpenSearch-SQL的方法。该方法通过引入预处理、提取、生成、精炼四个主要模块，并基于一致性对齐机制加入了一个对齐模块。这一架构通过对齐模块使各智能体的输入输出保持一致，从而减少指令跟随失败和幻觉现象。此外，文中设计了一种称为SQL-Like的中间语言，并优化了基于SQL-Like的结构化思维链（CoT）。同时，开发了一种动态少样本策略，即自学习查询思维链SQL（self-taught Query-CoT-SQL）。这些改进措施显著提升了LLMs在文本转SQL任务中的表现。

链接: https://arxiv.org/abs/2502.14913
作者: Xiangjin Xie,Guangwei Xu,Lingyan Zhao,Ruijie Guo
机构: Alibaba Cloud(阿里云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 15 pages

点击查看摘要

Abstract:Although multi-agent collaborative Large Language Models (LLMs) have achieved significant breakthroughs in the Text-to-SQL task, their performance is still constrained by various factors. These factors include the incompleteness of the framework, failure to follow instructions, and model hallucination problems. To address these problems, we propose OpenSearch-SQL, which divides the Text-to-SQL task into four main modules: Preprocessing, Extraction, Generation, and Refinement, along with an Alignment module based on a consistency alignment mechanism. This architecture aligns the inputs and outputs of agents through the Alignment module, reducing failures in instruction following and hallucination. Additionally, we designed an intermediate language called SQL-Like and optimized the structured CoT based on SQL-Like. Meanwhile, we developed a dynamic few-shot strategy in the form of self-taught Query-CoT-SQL. These methods have significantly improved the performance of LLMs in the Text-to-SQL task. In terms of model selection, we directly applied the base LLMs without any post-training, thereby simplifying the task chain and enhancing the framework’s portability. Experimental results show that OpenSearch-SQL achieves an execution accuracy(EX) of 69.3% on the BIRD development set, 72.28% on the test set, and a reward-based validity efficiency score (R-VES) of 69.36%, with all three metrics ranking first at the time of submission. These results demonstrate the comprehensive advantages of the proposed method in both effectiveness and efficiency. Comments: 15 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2502.14913 [cs.CL] (or arXiv:2502.14913v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.14913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-103] Universal Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery

【速读】：该论文旨在解决材料科学领域中化学元素的通用语义嵌入生成问题，以促进材料推理和发现。解决方案的关键在于ElementBERT模型，这是一种基于特定领域的BERT自然语言处理模型，通过在与合金相关的129万篇科学论文摘要上进行训练，捕捉到与合金相关的潜在知识和上下文关系。这种语义嵌入作为稳健的元素描述符，在多个下游任务中显著优于传统的经验描述符，包括预测机械性能和相变性质、分类相结构以及通过贝叶斯优化优化材料性能。

链接: https://arxiv.org/abs/2502.14912
作者: Yunze Jia,Yuehui Xian,Yangyang Xu,Pengfei Dang,Xiangdong Ding,Jun Sun,Yumei Zhou,Dezhen Xue
机构: 未知
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注: 5 figures

点击查看摘要

Abstract:We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers, to capture latent knowledge and contextual relationships specific to alloys. These semantic embeddings serve as robust elemental descriptors, consistently outperforming traditional empirical descriptors with significant improvements across multiple downstream tasks. These include predicting mechanical and transformation properties, classifying phase structures, and optimizing materials properties via Bayesian optimization. Applications to titanium alloys, high-entropy alloys, and shape memory alloys demonstrate up to 23% gains in prediction accuracy. Our results show that ElementBERT surpasses general-purpose BERT variants by encoding specialized alloy knowledge. By bridging contextual insights from scientific literature with quantitative inference, our framework accelerates the discovery and optimization of advanced materials, with potential applications extending beyond alloys to other material classes.
zh

[NLP-104] Batayan: A Filipino NLP benchmark for evaluating Large Language Models ACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在菲律宾语（Filipino）等低资源语言上的评估不足问题。论文的关键解决方案是引入了一个名为Batayan的综合基准，用于系统性地评估LLMs在理解、推理和生成三个关键自然语言处理（NLP）能力方面的表现。Batayan涵盖了八个任务，并通过严格的本地注释者驱动的标注过程确保了菲律宾语复杂形态和句法结构的流畅性和真实性，从而缓解了现有菲律宾语语料库中存在的翻译偏差问题。

链接: https://arxiv.org/abs/2502.14911
作者: Jann Railey Montalan,Jimson Paulo Layacan,David Demitri Africa,Richell Isaiah Flores,Michael T. Lopez II,Theresa Denise Magsajo,Anjanette Cayabyab,William Chandra Tjhi
机构: AI Singapore(人工智能Singapore); National University of Singapore(新加坡国立大学); Ateneo de Manila University(阿特内奥德马尼拉大学); University of Cambridge(剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to ACL 2025

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated remarkable capabilities on widely benchmarked high-resource languages; however, linguistic nuances of under-resourced languages remain unexplored. We introduce Batayan, a holistic Filipino benchmark designed to systematically evaluate LLMs across three key natural language processing (NLP) competencies: understanding, reasoning, and generation. Batayan consolidates eight tasks, covering both Tagalog and code-switched Taglish utterances. Our rigorous, native-speaker-driven annotation process ensures fluency and authenticity to the complex morphological and syntactic structures of Filipino, alleviating a pervasive translationese bias in existing Filipino corpora. We report empirical results on a variety of multilingual LLMs, highlighting significant performance gaps that signal the under-representation of Filipino in pretraining corpora, the unique hurdles in modeling Filipino’s rich morphology and construction, and the importance of explicit Filipino language support and instruction tuning. Moreover, we discuss the practical challenges encountered in dataset construction and propose principled solutions for building culturally and linguistically-faithful resources in under-represented languages. We also provide a public benchmark and leaderboard as a clear foundation for iterative, community-driven progress in Filipino NLP.
zh

[NLP-105] EvoP: Robust LLM Inference via Evolutionary Pruning

【速读】：该论文旨在解决大型语言模型（LLMs）在资源受限环境中的部署难题。现有结构化剪枝方法虽能通过移除冗余结构来缓解这一问题，但采用启发式剪枝策略导致性能不理想，并且忽视了数据特性。论文的关键解决方案是提出了一种名为EvoP的进化剪枝框架，包括基于聚类的校准数据集采样（CCDS）策略以创建更丰富的校准数据集，以及进化剪枝模式搜索（EPPS）方法以寻找最优剪枝模式，从而实现更高的性能与效率。

链接: https://arxiv.org/abs/2502.14910
作者: Shangyu Wu,Hongchao Du,Ying Xiong,Shuai Chen,Tei-wei Kuo,Nan Guan,Chun Jason Xue
机构: CityU; MBZUAI; Baidu; National Taiwan University; CityU; MBZUAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing structured pruning methods address this issue by removing redundant structures (e.g., elements, channels, layers) from the model. However, these methods employ a heuristic pruning strategy, which leads to suboptimal performance. Besides, they also ignore the data characteristics when pruning the model. To overcome these limitations, we propose EvoP, an evolutionary pruning framework for robust LLM inference. EvoP first presents a cluster-based calibration dataset sampling (CCDS) strategy for creating a more diverse calibration dataset. EvoP then introduces an evolutionary pruning pattern searching (EPPS) method to find the optimal pruning pattern. Compared to existing structured pruning techniques, EvoP achieves the best performance while maintaining the best efficiency. Experiments across different LLMs and different downstream tasks validate the effectiveness of the proposed EvoP, making it a practical and scalable solution for deploying LLMs in real-world applications. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.14910 [cs.CL] (or arXiv:2502.14910v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.14910 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-106] KOALA: Knowledge Conflict Augmentations for Robustness in Vision Language Models

【速读】：该论文旨在探究信息源中的冲突对多模态视觉语言模型（Vision Language Models, VLMs）在多模态设置下的影响，并提出了一种名为\segsub的框架来研究和改进VLMs针对三种不同类型知识冲突（参数冲突、源冲突和反事实冲突）的鲁棒性。解决方案的关键在于通过有针对性地扰动图像来源，评估VLMs在不同冲突类型下的表现，并发现微调模型能够显著提升其在反事实样本上的推理能力。

链接: https://arxiv.org/abs/2502.14908
作者: Peter Carragher,Nikitha Rao,Abhinand Jha,R Raghav,Kathleen M. Carley
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The robustness of large language models (LLMs) against knowledge conflicts in unimodal question answering systems has been well studied. However, the effect of conflicts in information sources on vision language models (VLMs) in multimodal settings has not yet been explored. In this work, we propose \segsub, a framework that applies targeted perturbations to image sources to study and improve the robustness of VLMs against three different types of knowledge conflicts, namely parametric, source, and counterfactual conflicts. Contrary to prior findings that showed that LLMs are sensitive to parametric conflicts arising from textual perturbations, we find VLMs are largely robust to image perturbation. On the other hand, VLMs perform poorly on counterfactual examples (30% accuracy) and fail to reason over source conflicts (1% accuracy). We also find a link between hallucinations and image context, with GPT-4o prone to hallucination when presented with highly contextualized counterfactual examples. While challenges persist with source conflicts, finetuning models significantly improves reasoning over counterfactual samples. Our findings highlight the need for VLM training methodologies that enhance their reasoning capabilities, particularly in addressing complex knowledge conflicts between multimodal sources.
zh

[NLP-107] GneissWeb: Preparing High Quality Data for LLM s at Scale

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）训练数据的质量和数量不足的问题。关键解决方案在于提出了GneissWeb数据集，该数据集包含约10万亿tokens，通过分片精确子串去重及精心构造的质量过滤器集合，实现了数据质量和数量之间的有利权衡。实验结果表明，使用GneissWeb数据集训练的模型在多个基准测试中显著优于使用现有先进开放数据集训练的模型。

链接: https://arxiv.org/abs/2502.14907
作者: Hajar Emami Gohari,Swanand Ravindra Kadhe,Syed Yousaf Shah. Constantin Adam,Abdulhamid Adebayo,Praneet Adusumilli,Farhan Ahmed,Nathalie Baracaldo Angel,Santosh Borse,Yuan-Chi Chang,Xuan-Hong Dang,Nirmit Desai,Ravital Eres,Ran Iwamoto,Alexei Karve,Yan Koyfman,Wei-Han Lee,Changchang Liu,Boris Lublinsky,Takuyo Ohko,Pablo Pesce,Maroun Touma,Shiqiang Wang,Shalisha Witherspoon,Herbert Woisetschlager,David Wood,Kun-Lung Wu,Issei Yoshida,Syed Zawad,Petros Zerfos,Yi Zhou,Bishwaranjan Bhattacharjee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM’s ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.14907 [cs.CL] (or arXiv:2502.14907v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.14907 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-108] Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models

【速读】：该论文旨在探究大规模多模态模型（Multimodal Models）在不同文化背景下与文化价值观的对齐情况。论文的关键在于评估图像是否可以作为文化的有效代理，并通过视觉和文本数据的整合来揭示这些价值观如何嵌入模型之中。研究表明，与大型语言模型（LLMs）类似，大规模视觉-语言模型（VLMs）对文化价值观表现出敏感性，但其对齐效果高度依赖于具体情境。尽管VLMs通过使用图像有潜力提升对文化价值观的理解，这种对齐在不同情境下存在显著差异，凸显了多模态模型对齐中的复杂性和尚未充分探索的挑战。

链接: https://arxiv.org/abs/2502.14906
作者: Srishti Yadav,Zhi Zhang,Daniel Hershcovich,Ekaterina Shutova
机构: Dept. of Computer Science, University of Copenhagen, Denmark (哥本哈根大学计算机科学系); ILLC, University of Amsterdam, Netherlands (阿姆斯特丹大学逻辑、语言与计算研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Investigating value alignment in Large Language Models (LLMs) based on cultural context has become a critical area of research. However, similar biases have not been extensively explored in large vision-language models (VLMs). As the scale of multimodal models continues to grow, it becomes increasingly important to assess whether images can serve as reliable proxies for culture and how these values are embedded through the integration of both visual and textual data. In this paper, we conduct a thorough evaluation of multimodal model at different scales, focusing on their alignment with cultural values. Our findings reveal that, much like LLMs, VLMs exhibit sensitivity to cultural values, but their performance in aligning with these values is highly context-dependent. While VLMs show potential in improving value understanding through the use of images, this alignment varies significantly across contexts highlighting the complexities and underexplored challenges in the alignment of multimodal models.
zh

[NLP-109] hink Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

【速读】：该论文旨在解决大型语言模型（LLM）生成过程中严格遵守结构化模式（schema）的问题。解决方案的关键在于利用LLM的推理能力，并通过一种新颖的训练管道结合合成推理数据集构建与定制奖励函数，在Group Relative Policy Optimization (GRPO)框架下训练模型的结构化推理技能。具体而言，研究首先在20K样本的非结构化到结构化数据集上进行R1强化学习，以建立基础推理能力；随后在另一10K推理样本数据集上进行有监督微调，专注于优化下游任务中的模式一致性。尽管训练规模相对有限，但该方法展示了其在保持模式一致性的稳健性能。

链接: https://arxiv.org/abs/2502.14905
作者: Bhavik Agarwal,Ishan Joshi,Viktoria Rojkova
机构: MasterControl AI Research (MasterControl AI研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.
zh

[NLP-110] PathRAG : Pruning Graph-based Retrieval Augmented Generation with Relational Paths

【速读】：该论文旨在解决现有基于图的检索增强生成（Graph-based Retrieval-Augmented Generation, RAG）方法中检索信息冗余的问题，并提出了一种新的方法来优化提示结构。论文的关键解决方案是PathRAG，它通过从索引图中检索关键关系路径，并将这些路径转化为文本形式以引导大型语言模型（LLMs）。PathRAG利用基于流的剪枝（flow-based pruning）有效减少冗余信息，同时采用基于路径的提示（path-based prompting）指导LLMs生成更具逻辑性和连贯性的响应。实验结果表明，PathRAG在六个数据集和五个评估维度上始终优于当前最先进的基线方法。

链接: https://arxiv.org/abs/2502.14902
作者: Boyu Chen,Zirui Guo,Zidan Yang,Yuluo Chen,Junze Chen,Zhenghao Liu,Chuan Shi,Cheng Yang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Hong Kong (香港大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: this https URL
zh

[NLP-111] Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models

【速读】：该论文旨在解决19世纪数字化存档新闻报道因光学字符识别（Optical Character Recognition, OCR）质量差或缺失而导致的可读性降低问题。解决方案的关键在于使用Pixtral 12B这一预训练的图像到文本语言模型进行OCR处理，其字符错误率中位数仅为1%，较次优模型低五倍，从而显著提升了文章识别质量和文本质量，并将文本分类为四种类型和十七个主题。

链接: https://arxiv.org/abs/2502.14901
作者: Jonathan Bourne
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Oscar Wilde said, “The difference between literature and journalism is that journalism is unreadable, and literature is not read.” Unfortunately, The digitally archived journalism of Oscar Wilde’s 19th century often has no or poor quality Optical Character Recognition (OCR), reducing the accessibility of these archives and making them unreadable both figuratively and literally. This paper helps address the issue by performing OCR on “The Nineteenth Century Serials Edition” (NCSE), an 84k-page collection of 19th-century English newspapers and periodicals, using Pixtral 12B, a pre-trained image-to-text language model. The OCR capability of Pixtral was compared to 4 other OCR approaches, achieving a median character error rate of 1%, 5x lower than the next best model. The resulting NCSE v2.0 dataset features improved article identification, high-quality OCR, and text classified into four types and seventeen topics. The dataset contains 1.4 million entries, and 321 million words. Example use cases demonstrate analysis of topic similarity, readability, and event tracking. NCSE v2.0 is freely available to encourage historical and sociological research. As a result, 21st-century readers can now share Oscar Wilde’s disappointment with 19th-century journalistic standards, reading the unreadable from the comfort of their own computers.
zh

[NLP-112] Can AI mimic the human ability to define neologisms?

【速读】：该论文旨在解决人工智能在定义通过不同构词过程形成的希腊新词（neologisms）方面的能力问题。关键在于通过对比人类与ChatGPT对于三种类型的希腊新词——合成词（blends）、复合词（compounds）和派生词（derivatives）的定义，评估其一致性，并提出需要将更先进的语义网络和上下文学习机制整合到AI模型中以改进其对复杂词形的解读，特别是复合词。

链接: https://arxiv.org/abs/2502.14900
作者: Georgios P. Georgiou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One ongoing debate in linguistics is whether Artificial Intelligence (AI) can effectively mimic human performance in language-related tasks. While much research has focused on various linguistic abilities of AI, little attention has been given to how it defines neologisms formed through different word formation processes. This study addresses this gap by examining the degree of agreement between human and AI-generated responses in defining three types of Greek neologisms: blends, compounds, and derivatives. The study employed an online experiment in which human participants selected the most appropriate definitions for neologisms, while ChatGPT received identical prompts. The results revealed fair agreement between human and AI responses for blends and derivatives but no agreement for compounds. However, when considering the majority response among humans, agreement with AI was high for blends and derivatives. These findings highlight the complexity of human language and the challenges AI still faces in capturing its nuances. In particular, they suggest a need for integrating more advanced semantic networks and contextual learning mechanisms into AI models to improve their interpretation of complex word formations, especially compounds.
zh

[NLP-113] Retrieval-augmented systems can be dangerous medical communicators

【速读】：该论文旨在解决由生成式人工智能（Generative AI）在医疗领域提供的查询答案可能具有误导性的问题。即使这些系统通过检索增强生成（retrieval-augmented generation）和引文定位（citation grounding）等技术减少了幻觉现象并提高了答案准确性，它们仍可能因去语境化事实、遗漏重要信息源以及强化患者误解或偏见等问题而产生误导。关键解决方案在于引入交流语用学（communication pragmatics）和提升对原始文档的理解，以缓解上述问题，并且这些措施不仅限于医疗领域。

链接: https://arxiv.org/abs/2502.14898
作者: Lionel Wong,Ayman Ali,Raymond Xiong,Shannon Zeijang Shen,Yoon Kim,Monica Agrawal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Preprint

点击查看摘要

Abstract:Patients have long sought health information online, and increasingly, they are turning to generative AI to answer their health-related queries. Given the high stakes of the medical domain, techniques like retrieval-augmented generation and citation grounding have been widely promoted as methods to reduce hallucinations and improve the accuracy of AI-generated responses and have been widely adopted into search engines. This paper argues that even when these methods produce literally accurate content drawn from source documents sans hallucinations, they can still be highly misleading. Patients may derive significantly different interpretations from AI-generated outputs than they would from reading the original source material, let alone consulting a knowledgeable clinician. Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems. In particular, we highlight how these models tend to decontextualize facts, omit critical relevant sources, and reinforce patient misconceptions or biases. We propose a series of recommendations – such as the incorporation of communication pragmatics and enhanced comprehension of source documents – that could help mitigate these issues and extend beyond the medical domain.
zh

[NLP-114] Revisiting Financial Sentiment Analysis: A Language Model Approach

【速读】：该论文旨在解决金融情感分析（Financial Sentiment Analysis, FSA）中依赖主观标注的情感标签来预测市场走势的问题。传统方法基于人工标注的情感意图推断市场影响具有固有挑战性。为解决这一问题，论文提出了一种基于市场反应的标注方法，通过短期价格趋势来赋予推文标签，从而使语言模型能够直接捕捉文本信号与市场动态之间的关系。关键在于采用这种市场驱动的标注策略，并通过提示调优（prompt-tuning）融入市场及时间上下文，显著提升了短期市场趋势预测的准确性。

链接: https://arxiv.org/abs/2502.14897
作者: Hamid Moradi-Kamali,Mohammad-Hossein Rajabi-Ghozlou,Mahdi Ghazavi,Ali Soltani,Amirreza Sattarzadeh,Reza Entezari-Maleki
机构: School of Computer Engineering, Iran University of Science and Technology(伊朗科技大学计算机工程学院), Tehran, Iran
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Financial Sentiment Analysis (FSA) traditionally relies on human-annotated sentiment labels to infer investor sentiment and forecast market movements. However, inferring the potential market impact of words based on human-perceived intentions is inherently challenging. We hypothesize that the historical market reactions to words, offer a more reliable indicator of their potential impact on markets than subjective sentiment interpretations by human annotators. To test this hypothesis, a market-derived labeling approach is proposed to assign tweet labels based on ensuing short-term price trends, enabling the language model to capture the relationship between textual signals and market dynamics directly. A domain-specific language model was fine-tuned on these labels, achieving up to an 11% improvement in short-term trend prediction accuracy over traditional sentiment-based benchmarks. Moreover, by incorporating market and temporal context through prompt-tuning, the proposed context-aware language model demonstrated an accuracy of 89.6% on a curated dataset of 227 impactful Bitcoin-related news events with significant market impacts. Aggregating daily tweet predictions into trading signals, our method outperformed traditional fusion models (which combine sentiment-based and price-based predictions). It challenged the assumption that sentiment-based signals are inferior to price-based predictions in forecasting market movements. Backtesting these signals across three distinct market regimes yielded robust Sharpe ratios of up to 5.07 in trending markets and 3.73 in neutral markets. Our findings demonstrate that language models can serve as effective short-term market predictors. This paradigm shift underscores the untapped capabilities of language models in financial decision-making and opens new avenues for market prediction applications.
zh

[NLP-115] Automating Customer Needs Analysis: A Comparative Study of Large Language Models in the Travel Industry

【速读】：该论文旨在解决从TripAdvisor帖子中提取旅行客户需求数量化的问题。研究通过比较多种开源与专有大型语言模型（LLMs），如GPT-4和Gemini，以及开源模型Mistral 7B，评估它们在这一特定领域的性能。关键解决方案在于利用BERTScore、ROUGE和BLEU等指标来衡量各模型在准确识别和总结客户需求方面的表现，并发现开源模型Mistral 7B在性能上可媲美更大规模的封闭模型，同时具备成本效益和定制化优势。研究表明，在选择适用于客户需求分析任务的LLM时，需要综合考虑模型大小、资源需求及性能指标。

链接: https://arxiv.org/abs/2404.17975
作者: Simone Barandoni,Filippo Chiarello,Lorenzo Cascone,Emiliano Marrale,Salvatore Puccio
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) have emerged as powerful tools for many tasks, such as extracting valuable insights from vast amounts of textual data. In this study, we conduct a comparative analysis of LLMs for the extraction of travel customer needs from TripAdvisor posts. Leveraging a diverse range of models, including both open-source and proprietary ones such as GPT-4 and Gemini, we aim to elucidate their strengths and weaknesses in this specialized domain. Through an evaluation process involving metrics such as BERTScore, ROUGE, and BLEU, we assess the performance of each model in accurately identifying and summarizing customer needs. Our findings highlight the efficacy of opensource LLMs, particularly Mistral 7B, in achieving comparable performance to larger closed models while offering affordability and customization benefits. Additionally, we underscore the importance of considering factors such as model size, resource requirements, and performance metrics when selecting the most suitable LLM for customer needs analysis tasks. Overall, this study contributes valuable insights for businesses seeking to leverage advanced NLP techniques to enhance customer experience and drive operational efficiency in the travel industry.
zh

计算机视觉

[CV-0] ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

【速读】：本文旨在提升文本到图像检索（Text-to-Image Retrieval）的性能。为实现这一目标，论文引入了一种新的框架——增强语言-图像预训练（Enhanced Language-Image Pre-training, ELIP），通过使用文本查询来预测一组视觉提示，以条件化ViT图像编码。ELIP的关键在于其能够有效地应用于常用的CLIP/SigLIP模型以及最先进的BLIP-2架构，并通过全局难样本挖掘及大规模数据集的选择与整理，使得在有限计算资源下进行模型训练成为可能。实验表明，此增强网络显著提升了CLIP/SigLIP的性能，并在文本到图像检索任务上超越了当前最优的BLIP-2模型。

链接: https://arxiv.org/abs/2502.15682
作者: Guanqi Zhan,Yuanpei Liu,Kai Han,Weidi Xie,Andrew Zisserman
机构: VGG, University of Oxford(牛津大学VGG); The University of Hong Kong(香港大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP/SigLIP and the state-of-the-art BLIP-2 architectures. To train the architecture with limited computing resources, we develop a ‘student friendly’ best practice involving global hard sample mining, and selection and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. Benefiting from the novel architecture and data curation, experiments show our enhanced network significantly boosts CLIP/SigLIP performance and outperforms the state-of-the-art BLIP-2 model on text-to-image retrieval.
zh

[CV-1] One-step Diffusion Models with f-Divergence Distribution Matching

【速读】：该论文旨在解决从扩散模型中高效采样（Sampling from diffusion models）的问题，特别是针对交互应用中的慢速迭代过程。论文的关键在于提出了一种新的 $f$ -散度最小化框架（ $f$ -distill），该框架通过不同的 $f$ -散度来平衡模式覆盖率与训练方差，从而优化分布匹配方法。关键解决方案是推导出教师与学生分布之间 $f$ -散度的梯度，并证明其可以表示为得分差异与由密度比确定的权重函数的乘积。这种方法在使用非模式寻求的散度时，自然强调了教师分布中高密度样本的重要性。实验表明，相比于现有的变异得分蒸馏方法，采用如前向KL散度和Jensen-Shannon散度等替代 $f$ -散度能够显著提升图像生成任务中的表现。

链接: https://arxiv.org/abs/2502.15681
作者: Yilun Xu,Weili Nie,Arash Vahdat
机构: NVIDIA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher’s distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel f -divergence minimization framework, termed f -distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the f -divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative f -divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, f -distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: this https URL
zh

[CV-2] BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

【速读】：该论文旨在解决视觉伺服机器人在执行长时序任务时所面临的Observation Space Shift (OSS)问题，即技能组合的顺序执行导致观测空间的变化，从而影响后续技能的表现。论文通过引入BOSS（Observation Space Shift基准）来验证和评估OSS对长时序任务的影响。解决方案的关键在于通过扩大训练数据规模和增加视觉多样性来训练每个技能，但研究结果显示这种方法不足以解决OSS问题。

链接: https://arxiv.org/abs/2502.15679
作者: Yue Yang,Linfeng Zhao,Mingyu Ding,Gedas Bertasius,Daniel Szafir
机构: The University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校); Northeastern University(东北大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotics has long sought to develop visual-servoing robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To validate OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: “Single Predicate Shift”, “Accumulated Predicate Shift”, and “Skill Chaining”, each designed to assess a different aspect of OSS’s negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate a potential solution to OSS that scales up the training data for each skill with a larger and more visually diverse set of demonstrations, with our results showing it is not sufficient to resolve OSS. The project page is: this https URL
zh

[CV-3] VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

【速读】：该论文旨在探索大规模生成式视频模型在自动驾驶中的应用潜力，提出了一种开源的自回归视频模型（VaViM）及其伴随的视频动作模型（VaVAM）。解决方案的关键在于通过视频预训练捕捉驾驶场景的语义和动态，并利用这些学习到的表征生成驾驶轨迹。这两者共同构成了从感知到行动的完整管道。研究结果表明，基于视频的预训练对于自动驾驶具有前景，并揭示了所学表征的语义丰富性、视频合成尺度增益的优势以及闭环评估中模型大小、数据与安全指标之间的复杂关系。

链接: https://arxiv.org/abs/2502.15672
作者: Florent Bartoccioni,Elias Ramzi,Victor Besnier,Shashanka Venkataramanan,Tuan-Hung Vu,Yihong Xu,Loick Chambon,Spyros Gidaris,Serkan Odabas,David Hurych,Renaud Marlet,Alexandre Boulch,Mickael Chen,Éloi Zablocki,Andrei Bursuc,Eduardo Valle,Matthieu Cord
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Code and model: this https URL , project page: this https URL

点击查看摘要

Abstract:We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at this https URL
zh

[CV-4] Logit Disagreement: OoD Detection with Bayesian Neural Networks ECCV2024

【速读】：该论文旨在解决贝叶斯神经网络（Bayesian Neural Networks, BNNs）在不确定性量化及异常检测中的表现问题。论文的关键在于提出了一种新的方法来估算BNNs中的认知不确定性（Epistemic Uncertainty），即通过测量校正后的预softmax量（logits）之间的分歧程度，作为均场变分推断（mean field variational inference）下BNN的认知不确定性估计。这种方法在多种异常检测实验中显示出优于互信息（Mutual Information）的性能，并且与贝叶斯基准预测熵（predictive entropy）的表现相当。

链接: https://arxiv.org/abs/2502.15648
作者: Kevin Raina
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Presented at ECCV 2024 Workshop: 3rd Workshop on Uncertainty Quantification for Computer Vision

点击查看摘要

Abstract:Bayesian neural networks (BNNs), which estimate the full posterior distribution over model parameters, are well-known for their role in uncertainty quantification and its promising application in out-of-distribution detection (OoD). Amongst other uncertainty measures, BNNs provide a state-of-the art estimation of predictive entropy (total uncertainty) which can be decomposed as the sum of mutual information and expected entropy. In the context of OoD detection the estimation of predictive uncertainty in the form of the predictive entropy score confounds aleatoric and epistemic uncertainty, the latter being hypothesized to be high for OoD points. Despite these justifications, the mutual information score has been shown to perform worse than predictive entropy. Taking inspiration from Bayesian variational autoencoder (BVAE) literature, this work proposes to measure the disagreement between a corrected version of the pre-softmax quantities, otherwise known as logits, as an estimate of epistemic uncertainty for Bayesian NNs under mean field variational inference. The three proposed epistemic uncertainty scores demonstrate marked improvements over mutual information on a range of OoD experiments, with equal performance otherwise. Moreover, the epistemic uncertainty scores perform on par with the Bayesian benchmark predictive entropy on a range of MNIST and CIFAR10 experiments.
zh

[CV-5] Para-Lane: Multi-Lane Dataset Registering Parallel Scans for Benchmarking Novel View Synthesis

【速读】：该论文旨在解决评估端到端自动驾驶系统在交叉车道场景中的性能问题。为实现这一目标，论文提出了一套新的多车道数据集，该数据集专门用于基于真实世界扫描的新驾驶视角合成。关键解决方案在于开发了一个包含25组关联序列的数据集，涵盖16,000个前视图图像、64,000个全景图像以及16,000个LiDAR帧，并且所有帧均标记以区分动态对象与静态元素。此外，论文还提供了一种方法来解决和评估多传感器姿态的质量，以实现多模态数据对齐，从而促进此类数据集在实际应用中的构建。

链接: https://arxiv.org/abs/2502.15635
作者: Ziqian Ni,Sicong Du,Zhenghua Hou,Chenming Wu,Sheng Yang
机构: Autonomous Driving Lab, CaiNiao Inc., Alibaba Group (菜鸟自动驾驶实验室，阿里巴巴集团); Baidu Research (百度研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To evaluate end-to-end autonomous driving systems, a simulation environment based on Novel View Synthesis (NVS) techniques is essential, which synthesizes photo-realistic images and point clouds from previously recorded sequences under new vehicle poses, particularly in cross-lane scenarios. Therefore, the development of a multi-lane dataset and benchmark is necessary. While recent synthetic scene-based NVS datasets have been prepared for cross-lane benchmarking, they still lack the realism of captured images and point clouds. To further assess the performance of existing methods based on NeRF and 3DGS, we present the first multi-lane dataset registering parallel scans specifically for novel driving view synthesis dataset derived from real-world scans, comprising 25 groups of associated sequences, including 16,000 front-view images, 64,000 surround-view images, and 16,000 LiDAR frames. All frames are labeled to differentiate moving objects from static elements. Using this dataset, we evaluate the performance of existing approaches in various testing scenarios at different lanes and distances. Additionally, our method provides the solution for solving and assessing the quality of multi-sensor poses for multi-modal data alignment for curating such a dataset in real-world. We plan to continually add new sequences to test the generalization of existing methods across different scenarios. The dataset is released publicly at the project page: this https URL.
zh

[CV-6] RGB-Only Gaussian Splatting SLAM for Unbounded Outdoor Scenes ICRA2025

【速读】：该论文旨在解决现有基于高斯点云溅射（Gaussian Splatting, GS）的方法在室外场景中表现不佳的问题。这些方法主要针对室内场景，并依赖于RGB-D传感器或预训练的深度估计模型。论文的关键解决方案是提出了一种仅使用RGB图像的高斯点云溅射SLAM方法——OpenGS-SLAM。该方法首先利用一个点图回归网络生成帧间一致的点图，用于姿态估计。与常用的深度图相比，点图包含了多视角下的空间关系和场景几何信息，从而实现更稳健的相机姿态估计。此外，通过将估计的相机姿态与3DGS渲染集成到一个端到端可微的管道中，实现了相机姿态和3DGS场景参数的同时优化，显著提升了系统的跟踪精度。

链接: https://arxiv.org/abs/2502.15633
作者: Sicheng Yu,Chong Cheng,Yifan Zhou,Xiaojun Yang,Hao Wang
机构: The Hong Kong University of Science and Technology (GuangZhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become a popular solution in SLAM, as it can produce high-fidelity novel views. However, previous GS-based methods primarily target indoor scenes and rely on RGB-D sensors or pre-trained depth estimation models, hence underperforming in outdoor scenarios. To address this issue, we propose a RGB-only gaussian splatting SLAM method for unbounded outdoor scenes–OpenGS-SLAM. Technically, we first employ a pointmap regression network to generate consistent pointmaps between frames for pose estimation. Compared to commonly used depth maps, pointmaps include spatial relationships and scene geometry across multiple views, enabling robust camera pose estimation. Then, we propose integrating the estimated camera poses with 3DGS rendering as an end-to-end differentiable pipeline. Our method achieves simultaneous optimization of camera poses and 3DGS scene parameters, significantly enhancing system tracking accuracy. Specifically, we also design an adaptive scale mapper for the pointmap regression network, which provides more accurate pointmap mapping to the 3DGS map representation. Our experiments on the Waymo dataset demonstrate that OpenGS-SLAM reduces tracking error to 9.8% of previous 3DGS methods, and achieves state-of-the-art results in novel view synthesis. Project Page: this https URL
zh

[CV-7] Continual Person Identification using Footstep-Induced Floor Vibrations on Heterogeneous Floor Structures

【速读】：该论文旨在解决智能建筑中实时人员身份识别的问题，特别是在没有预先收集每个人数据的情况下。现有方法依赖于摄像头或穿戴设备等，存在隐私问题或设备携带限制。本文提出的方法通过分析脚步引起的结构振动来实现非侵入式的人员身份识别，但面临高数据变异性（由结构异质性和人体步态变化引起）的挑战，导致在线识别算法性能不佳。解决方案的关键在于量化并分解不同来源的变异性，并设计一种特征变换函数以减少个体数据内部的变异性，从而提高不同个体数据的可分离性。实验结果显示，该方法实现了70%的变异性减少和90%的在线人员识别准确率。

链接: https://arxiv.org/abs/2502.15632
作者: Yiwen Dong,Hae Young Noh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Applied Physics (physics.app-ph)
备注:

点击查看摘要

Abstract:Person identification is important for smart buildings to provide personalized services such as health monitoring, activity tracking, and personnel management. However, previous person identification relies on pre-collected data from everyone, which is impractical in many buildings and public facilities in which visitors are typically expected. This calls for a continual person identification system that gradually learns people’s identities on the fly. Existing studies use cameras to achieve this goal, but they require direct line-of-sight and also have raised privacy concerns in public. Other modalities such as wearables and pressure mats are limited by the requirement of device-carrying or dense deployment. Thus, prior studies introduced footstep-induced structural vibration sensing, which is non-intrusive and perceived as more privacy-friendly. However, this approach has a significant challenge: the high variability of vibration data due to structural heterogeneity and human gait variations, which makes online person identification algorithms perform poorly. In this paper, we characterize the variability in footstep-induced structural vibration data for accurate online person identification. To achieve this, we quantify and decompose different sources of variability and then design a feature transformation function to reduce the variability within each person’s data to make different people’s data more separable. We evaluate our approach through field experiments with 20 people. The results show a 70% variability reduction and a 90% accuracy for online person identification.
zh

[CV-8] WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

【速读】：该论文旨在解决构建高度逼真的虚拟世界过程中需要大量专业人员使用传统3D建模软件进行繁重劳动的问题。解决方案的关键在于WorldCraft系统，该系统利用大型语言模型（LLM）代理通过程序化生成来创建包含对象的室内和室外场景，并允许用户通过直观的自然语言命令控制单个对象属性和场景布局。在该框架中，协调者代理与两个专门的LLM代理（ForgeIt和ArrangeIt）协作完成场景创建任务：ForgeIt通过自动验证不断增长的手动操作实现个体对象的精确定制；ArrangeIt则通过制定层次优化问题来实现平衡人体工学和美学考虑的布局设计。此外，该系统还集成了轨迹控制代理，使用户能够通过自然语言交互来动画化场景和操控相机。这些方法共同提高了非专业用户的创作效率和灵活性。

链接: https://arxiv.org/abs/2502.15601
作者: Xinhang Liu,Chi-Keung Tang,Yu-Wing Tai
机构: The Hong Kong University of Science and Technology(HKUST); Dartmouth College(达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Constructing photorealistic virtual worlds has applications across various fields, but it often requires the extensive labor of highly trained professionals to operate conventional 3D modeling software. To democratize this process, we introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create indoor and outdoor scenes populated with objects, allowing users to control individual object attributes and the scene layout using intuitive natural language commands. In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation: ForgeIt, which integrates an ever-growing manual through auto-verification to enable precise customization of individual objects, and ArrangeIt, which formulates hierarchical optimization problems to achieve a layout that balances ergonomic and aesthetic considerations. Additionally, our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions. Our system is also compatible with off-the-shelf deep 3D generators to enrich scene assets. Through evaluations and comparisons with state-of-the-art methods, we demonstrate the versatility of WorldCraft, ranging from single-object customization to intricate, large-scale interior and exterior scene designs. This system empowers non-professionals to bring their creative visions to life.
zh

[CV-9] Estimating Vehicle Speed on Roadways Using RNNs and Transformers: A Video-based Approach

【速读】：该论文旨在解决传统车辆速度估计方法（如雷达和手动系统）因高成本、有限覆盖范围及潜在干扰而存在的局限性。解决方案的关键在于利用长短期记忆网络（LSTM）、门控循环单元（GRU）和Transformer模型，通过有效管理视频帧时间序列中的长期依赖关系和自注意力机制，实现对整个序列的并行处理，并聚焦于数据中最具有信息量的部分。研究结果表明，LSTM和GRU因其先进的门控机制优于基本循环神经网络（RNN），而Transformer在不同序列长度和复杂度下展现出卓越的适应性和鲁棒性，适用于实时交通条件下的多样化应用。

链接: https://arxiv.org/abs/2502.15545
作者: Sai Krishna Reddy Mareddy,Dhanush Upplapati,Dhanush Kumar Antharam
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This project explores the application of advanced machine learning models, specifically Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and Transformers, to the task of vehicle speed estimation using video data. Traditional methods of speed estimation, such as radar and manual systems, are often constrained by high costs, limited coverage, and potential disruptions. In contrast, leveraging existing surveillance infrastructure and cutting-edge neural network architectures presents a non-intrusive, scalable solution. Our approach utilizes LSTM and GRU to effectively manage long-term dependencies within the temporal sequence of video frames, while Transformers are employed to harness their self-attention mechanisms, enabling the processing of entire sequences in parallel and focusing on the most informative segments of the data. This study demonstrates that both LSTM and GRU outperform basic Recurrent Neural Networks (RNNs) due to their advanced gating mechanisms. Furthermore, increasing the sequence length of input data consistently improves model accuracy, highlighting the importance of contextual information in dynamic environments. Transformers, in particular, show exceptional adaptability and robustness across varied sequence lengths and complexities, making them highly suitable for real-time applications in diverse traffic conditions. The findings suggest that integrating these sophisticated neural network models can significantly enhance the accuracy and reliability of automated speed detection systems, thus promising to revolutionize traffic management and road safety.
zh

[CV-10] Depth-aware Fusion Method based on Image and 4D Radar Spectrum for 3D Object Detection

【速读】：该论文旨在解决自动驾驶在复杂环境下的安全与可靠性问题，特别是在恶劣天气条件下环境感知的准确性与鲁棒性。解决方案的关键在于整合4D毫米波雷达与相机这两种高度互补且成本效益高的传感器。通过将富含纹理的图像与富含深度信息的雷达数据在鸟瞰图（BEV）视角下进行融合，并采用注意力机制，增强了3D目标检测。此外，论文提出利用基于GAN的网络从雷达频谱生成深度图像，即使在没有深度传感器的情况下也能进一步提升检测精度。

链接: https://arxiv.org/abs/2502.15516
作者: Yue Sun,Yeqiang Qian,Chunxiang Wang,Ming Yang
机构: The Global Institute of Future Technology, Shanghai Jiao Tong University (上海交通大学未来技术学院); The Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系); Key Laboratory of System Control and Information Processing, Ministry of Education of China (中国教育部系统控制与信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Safety and reliability are crucial for the public acceptance of autonomous driving. To ensure accurate and reliable environmental perception, intelligent vehicles must exhibit accuracy and robustness in various environments. Millimeter-wave radar, known for its high penetration capability, can operate effectively in adverse weather conditions such as rain, snow, and fog. Traditional 3D millimeter-wave radars can only provide range, Doppler, and azimuth information for objects. Although the recent emergence of 4D millimeter-wave radars has added elevation resolution, the radar point clouds remain sparse due to Constant False Alarm Rate (CFAR) operations. In contrast, cameras offer rich semantic details but are sensitive to lighting and weather conditions. Hence, this paper leverages these two highly complementary and cost-effective sensors, 4D millimeter-wave radar and camera. By integrating 4D radar spectra with depth-aware camera images and employing attention mechanisms, we fuse texture-rich images with depth-rich radar data in the Bird’s Eye View (BEV) perspective, enhancing 3D object detection. Additionally, we propose using GAN-based networks to generate depth images from radar spectra in the absence of depth sensors, further improving detection accuracy.
zh

[CV-11] Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection

【速读】：该论文旨在解决基于PETR方法在量化至INT8推理时性能显著下降的问题，具体表现为mAP下降58.2%，NDS下降36.9%（NuScenes数据集）。为了解决这一问题，论文提出了一种名为Q-PETR的量化感知位置嵌入变换方案。Q-PETR通过提供一种量化友好和部署友好的架构，在保持PETR原有性能的同时，显著缩小了PETR系列方法在INT8与FP32推理之间的精度差距。关键在于其能够在标准的每张量8位后训练量化条件下，将mAP和NDS的下降幅度控制在1%以内，并且在浮点精度方面超越原始PETR的性能。

链接: https://arxiv.org/abs/2502.15488
作者: Jiangyong Yu,Changyong Shu,Dawei Yang,Zichen Yu,Xing Hu,Yan Chen
机构: Houmo AI(厚摩科技); Dalian University of Technology(大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PETR-based methods have dominated benchmarks in 3D perception and are increasingly becoming a key component in modern autonomous driving systems. However, their quantization performance significantly degrades when INT8 inference is required, with a degradation of 58.2% in mAP and 36.9% in NDS on the NuScenes dataset. To address this issue, we propose a quantization-aware position embedding transformation for multi-view 3D object detection, termed Q-PETR. Q-PETR offers a quantizationfriendly and deployment-friendly architecture while preserving the original performance of PETR. It substantially narrows the accuracy gap between INT8 and FP32 inference for PETR-series methods. Without bells and whistles, our approach reduces the mAP and NDS drop to within 1% under standard 8-bit per-tensor post-training quantization. Furthermore, our method exceeds the performance of the original PETR in terms of floating-point precision. Extensive experiments across a variety of PETR-series models demonstrate its broad generalization.
zh

[CV-12] Confidence-Based Annotation Of Brain Tumours In Ultrasound

【速读】：该论文旨在解决脑肿瘤在超声图像中离散分割标注的挑战，特别关注肿瘤边缘的偶然不确定性（Aleatoric Uncertainty），尤其是对于弥漫性肿瘤。论文的关键解决方案在于提出了一种稀疏置信度方法（Sparse Confidence Method）进行标注，该方法融合了与肿瘤边缘相关的不确定性，同时通过减少主观性来最小化观察者间的方差，从而降低注释者的认识不确定性（Epistemic Uncertainty）。这种方法基于计算机视觉和放射学理论设计的协议，并通过实验证明了其有效性。

链接: https://arxiv.org/abs/2502.15484
作者: Alistair Weld,Luke Dixon,Alfie Roddan,Giulio Anichini,Sophie Camp,Stamatia Giannarou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: An investigation of the challenge of annotating discrete segmentations of brain tumours in ultrasound, with a focus on the issue of aleatoric uncertainty along the tumour margin, particularly for diffuse tumours. A segmentation protocol and method is proposed that incorporates this margin-related uncertainty while minimising the interobserver variance through reduced subjectivity, thereby diminishing annotator epistemic uncertainty. Approach: A sparse confidence method for annotation is proposed, based on a protocol designed using computer vision and radiology theory. Results: Output annotations using the proposed method are compared with the corresponding professional discrete annotation variance between the observers. A linear relationship was measured within the tumour margin region, with a Pearson correlation of 0.8. The downstream application was explored, comparing training using confidence annotations as soft labels with using the best discrete annotations as hard labels. In all evaluation folds, the Brier score was superior for the soft-label trained network. Conclusion: A formal framework was constructed to demonstrate the infeasibility of discrete annotation of brain tumours in B-mode ultrasound. Subsequently, a method for sparse confidence-based annotation is proposed and evaluated. Keywords: Brain tumours, ultrasound, confidence, annotation.
zh

[CV-13] On Neural BRDFs: A Thorough Comparison of State-of-the-Art Approaches WACV

【速读】：该论文旨在全面评估多种双向反射分布函数（BRDF）建模方法，并填补文献中缺乏系统比较的空白。关键解决方案包括提出两种扩展：一种新颖的神经BRDF的可加性组合策略，将反射分解为漫反射和镜面反射部分；以及一种确保互易性的输入映射方法，该方法通过构造保证互易性，而先前的方法仅通过软约束实现这一点。

链接: https://arxiv.org/abs/2502.15480
作者: Florian Hofherr,Bjoern Haefner,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:The bidirectional reflectance distribution function (BRDF) is an essential tool to capture the complex interaction of light and matter. Recently, several works have employed neural methods for BRDF modeling, following various strategies, ranging from utilizing existing parametric models to purely neural parametrizations. While all methods yield impressive results, a comprehensive comparison of the different approaches is missing in the literature. In this work, we present a thorough evaluation of several approaches, including results for qualitative and quantitative reconstruction quality and an analysis of reciprocity and energy conservation. Moreover, we propose two extensions that can be added to existing approaches: A novel additive combination strategy for neural BRDFs that split the reflectance into a diffuse and a specular part, and an input mapping that ensures reciprocity exactly by construction, while previous approaches only ensure it by soft constraints.
zh

[CV-14] CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution

【速读】：该论文旨在解决低比特量化在图像超分辨率（Super-Resolution, SR）任务中的精度下降问题。论文指出，这种精度下降主要归因于激活值而非模型权重的量化误差。解决方案的关键在于提出了一种基于条件数的低比特后训练量化方法——CondiQuant。通过将量化误差表示为权重度量的条件数，并设计了一种有效的近邻梯度下降算法来迭代最小化条件数，同时保持输出不变，从而在不增加计算开销的情况下提升了现有后训练量化方法的精度，并实现了模型参数理论上最优的压缩比。

链接: https://arxiv.org/abs/2502.15478
作者: Kai Liu,Dehui Wang,Zhiteng Li,Zheng Chen,Yong Guo,Wenbo Li,Linghe Kong,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); South China University of Technology (华南理工大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Code and models are released at this https URL

点击查看摘要

Abstract:Low-bit model quantization for image super-resolution (SR) is a longstanding task that is renowned for its surprising compression and acceleration ability. However, accuracy degradation is inevitable when compressing the full-precision (FP) model to ultra-low bit widths (2~4 bits). Experimentally, we observe that the degradation of quantization is mainly attributed to the quantization of activation instead of model weights. In numerical analysis, the condition number of weights could measure how much the output value can change for a small change in the input argument, inherently reflecting the quantization error. Therefore, we propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. Specifically, we formulate the quantization error as the condition number of weight metrics. By decoupling the representation ability and the quantization sensitivity, we design an efficient proximal gradient descent algorithm to iteratively minimize the condition number and maintain the output still. With comprehensive experiments, we demonstrate that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead and gains the theoretically optimal compression ratio in model parameters. Our code and model are released at this https URL.
zh

[CV-15] Aligning Task- and Reconstruction-Oriented Communications for Edge Intelligence

【速读】：该论文旨在解决现有通信系统在满足现代AI驱动应用（如自动驾驶和语义分割）实时性和任务特定需求方面的不足。传统方法侧重于信息重建，而新兴的任务导向通信虽然能够更好地适应特定任务需求，但通常需要全面优化编码器、解码器及修改后的推理神经网络，导致跨系统重构和兼容性问题。论文的关键解决方案在于提出了一种新的通信框架，将信息瓶颈（Information Bottleneck, IB）理论应用于优化数据传输，通过最小化与任务相关的损失函数来保持原始数据结构，并引入了一个信息重塑器。此外，还设计了一种联合信源信道编码（Joint Source-Channel Coding, JSCC）调制方案，使其与经典调制技术兼容，从而实现在现有数字基础设施中部署AI技术。实验评估表明，该框架在边缘自动驾驶场景中显著减少了每单位服务的比特数，相比JPEG、JPEG2000和BPG等现有方法降低了99.19%，同时不影响任务执行的有效性。

链接: https://arxiv.org/abs/2502.15472
作者: Yufeng Diao,Yichi Zhang,Changyang She,Philip Guodong Zhao,Emma Liying Li
机构: School of Computing Science, University of Glasgow (格拉斯哥大学计算机科学学院); Department of Computer Science, University of Manchester (曼彻斯特大学计算机科学系); James Watt School of Engineering, University of Glasgow (詹姆斯瓦特工程学院, 格拉斯哥大学); School of Electrical and Information Engineering, University of Sydney (悉尼大学电气与信息工程学院)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for publication in IEEE Journal on Selected Areas in Communications (JSAC)

点击查看摘要

Abstract:Existing communication systems aim to reconstruct the information at the receiver side, and are known as reconstruction-oriented communications. This approach often falls short in meeting the real-time, task-specific demands of modern AI-driven applications such as autonomous driving and semantic segmentation. As a new design principle, task-oriented communications have been developed. However, it typically requires joint optimization of encoder, decoder, and modified inference neural networks, resulting in extensive cross-system redesigns and compatibility issues. This paper proposes a novel communication framework that aligns reconstruction-oriented and task-oriented communications for edge intelligence. The idea is to extend the Information Bottleneck (IB) theory to optimize data transmission by minimizing task-relevant loss function, while maintaining the structure of the original data by an information reshaper. Such an approach integrates task-oriented communications with reconstruction-oriented communications, where a variational approach is designed to handle the intractability of mutual information in high-dimensional neural network features. We also introduce a joint source-channel coding (JSCC) modulation scheme compatible with classical modulation techniques, enabling the deployment of AI technologies within existing digital infrastructures. The proposed framework is particularly effective in edge-based autonomous driving scenarios. Our evaluation in the Car Learning to Act (CARLA) simulator demonstrates that the proposed framework significantly reduces bits per service by 99.19% compared to existing methods, such as JPEG, JPEG2000, and BPG, without compromising the effectiveness of task execution.
zh

[CV-16] Game State and Spatio-temporal Action Detection in Soccer using Graph Neural Networks and 3D Convolutional Networks

【速读】：该论文旨在解决足球赛事中球事件（ball events）精确实时标注的问题。当前基于单目视频流进行精确且详尽标注是一项繁琐且昂贵的手动任务。论文的关键解决方案在于提出了一种时空动作检测方法，该方法通过图神经网络（Graph Neural Networks）结合视觉信息和比赛状态信息，与最先进的三维卷积神经网络（3D CNNs）端到端训练，从而提升了模型性能。这种方法通过整合比赛状态信息，增强了单纯视觉预测的效果。

链接: https://arxiv.org/abs/2502.15462
作者: Jeremie Ochin,Guillaume Devineau,Bogdan Stanciulescu,Sotiris Manitsaris
机构: Centre for Robotics, MINES Paris - PSL(巴黎高等矿业学院 - PSL研究中心), France; Footovision(福托维森), France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Soccer analytics rely on two data sources: the player positions on the pitch and the sequences of events they perform. With around 2000 ball events per game, their precise and exhaustive annotation based on a monocular video stream remains a tedious and costly manual task. While state-of-the-art spatio-temporal action detection methods show promise for automating this task, they lack contextual understanding of the game. Assuming professional players’ behaviors are interdependent, we hypothesize that incorporating surrounding players’ information such as positions, velocity and team membership can enhance purely visual predictions. We propose a spatio-temporal action detection approach that combines visual and game state information via Graph Neural Networks trained end-to-end with state-of-the-art 3D CNNs, demonstrating improved metrics through game state integration.
zh

[CV-17] Memory Helps but Confabulation Misleads: Understanding Streaming Events in Videos with MLLM s

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在处理流视频时理解能力不足的问题，特别是这些模型如何利用过去事件的记忆来增强当前事件的理解。论文的关键在于提出了一种考虑到混淆记忆的内存修改方法，以减轻因预测先前事件而可能产生的错误信息，从而提高基于内存的事件理解性能。

链接: https://arxiv.org/abs/2502.15457
作者: Gengyuan Zhang,Mingcong Ding,Tong Liu,Yao Zhang,Volker Tresp
机构: Ludwig-Maximilians-Universität München (路德维希-马克西米利安大学慕尼黑分校); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Short paper (5 pages)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.
zh

[CV-18] MVIP – A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition

【速读】：该论文旨在解决工业零件识别中的多模态和多视角应用问题。当前数据集虽然提供了多种表示方法，但工业应用环境具有独特挑战，如少量或大量训练数据、视觉相似零件以及不同尺寸的对象，同时需要在成本和时间限制下实现接近100%的高精度识别。论文的关键在于提出MVIP数据集，该数据集结合校准的RGBD多视角数据与物体物理属性、自然语言及超类等附加上下文信息，以研究和推动现有先进方法在相关下游任务中的迁移能力，并促进模态融合、合成数据生成及复杂数据采样等领域的研究。

链接: https://arxiv.org/abs/2502.15448
作者: Paul Koch,Marian Schlüter,Jörg Krüger
机构: Fraunhofer IPK; Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IMPROVE 2025

点击查看摘要

Abstract:We present MVIP, a novel dataset for multi-modal and multi-view application-oriented industrial part recognition. Here we are the first to combine a calibrated RGBD multi-view dataset with additional object context such as physical properties, natural language, and super-classes. The current portfolio of available datasets offers a wide range of representations to design and benchmark related methods. In contrast to existing classification challenges, industrial recognition applications offer controlled multi-modal environments but at the same time have different problems than traditional 2D/3D classification challenges. Frequently, industrial applications must deal with a small amount or increased number of training data, visually similar parts, and varying object sizes, while requiring a robust near 100% top 5 accuracy under cost and time constraints. Current methods tackle such challenges individually, but direct adoption of these methods within industrial applications is complex and requires further research. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks towards an efficient deployment of industrial classifiers. Additionally, we intend to push with MVIP research regarding several modality fusion topics, (automated) synthetic data generation, and complex data sampling – combined in a single application-oriented benchmark.
zh

[CV-19] LEAP: Enhancing Vision-Based Occupancy Networks with Lightweight Spatio-Temporal Correlation

【速读】：该论文旨在解决基于视觉的体素占用网络在重建周围环境时由于遮挡和稀疏视觉线索导致的精度受限问题。解决方案的关键在于引入了一种名为轻量时空关联（LEAP）的方法，该方法通过将信息从近期的基础特征和运动特征映射到共享的紧凑潜在空间，并利用三流融合架构建立完整的关联，从而显著提升了现有占用网络的性能，且计算开销极小。

链接: https://arxiv.org/abs/2502.15438
作者: Fengcheng Yu,Haoran Xu,Canming Xia,Guang Tan
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-based occupancy networks provide an end-to-end solution for reconstructing the surrounding environment using semantic occupied voxels derived from multi-view images. This technique relies on effectively learning the correlation between pixel-level visual information and voxels. Despite recent advancements, occupancy results still suffer from limited accuracy due to occlusions and sparse visual cues. To address this, we propose a Lightweight Spatio-Temporal Correlation (LEAP) method, which significantly enhances the performance of existing occupancy networks with minimal computational overhead. LEAP can be seamlessly integrated into various baseline networks, enabling a plug-and-play application. LEAP operates in three stages: 1) it tokenizes information from recent baseline and motion features into a shared, compact latent space; 2) it establishes full correlation through a tri-stream fusion architecture; 3) it generates occupancy results that strengthen the baseline’s output. Extensive experiments demonstrate the efficiency and effectiveness of our method, outperforming the latest baseline models. The source code and several demos are available in the supplementary material.
zh

[CV-20] Enhancing Vehicle Make and Model Recognition with 3D Attention Modules

【速读】：该论文旨在解决车辆品牌与型号识别（Vehicle Make and Model Recognition, VMMR）中的细粒度分类挑战，特别是类别间的相似性和类内变化的问题。论文的关键解决方案在于引入一个注意力模块，该模块不增加原有模型的参数，通过生成三维注意力权重来优化特征图。此模块被整合到卷积模型的中间部分，以增强对含有区分特征的关键区域的关注。实验结果表明，该方法在Stanford Cars数据集上的准确率达到90.69%，优于其他比较的模型。

链接: https://arxiv.org/abs/2502.15398
作者: Narges Semiromizadeh,Omid Nejati Manzari,Shahriar B. Shokouhi,Sattar Mirzakuchaki
机构: School of Electrical Engineering, Iran University of Science and Technology (伊朗科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vehicle make and model recognition (VMMR) is a crucial component of the Intelligent Transport System, garnering significant attention in recent years. VMMR has been widely utilized for detecting suspicious vehicles, monitoring urban traffic, and autonomous driving systems. The complexity of VMMR arises from the subtle visual distinctions among vehicle models and the wide variety of classes produced by manufacturers. Convolutional Neural Networks (CNNs), a prominent type of deep learning model, have been extensively employed in various computer vision tasks, including VMMR, yielding remarkable results. As VMMR is a fine-grained classification problem, it primarily faces inter-class similarity and intra-class variation challenges. In this study, we implement an attention module to address these challenges and enhance the model’s focus on critical areas containing distinguishing features. This module, which does not increase the parameters of the original model, generates three-dimensional (3-D) attention weights to refine the feature map. Our proposed model integrates the attention module into two different locations within the middle section of a convolutional model, where the feature maps from these sections offer sufficient information about the input frames without being overly detailed or overly coarse. The performance of our proposed model, along with state-of-the-art (SOTA) convolutional and transformer-based models, was evaluated using the Stanford Cars dataset. Our proposed model achieved the highest accuracy, 90.69%, among the compared models.
zh

[CV-21] LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在生成长篇视频描述时面临的挑战。尽管这些模型能够处理长达数小时的视频输入，但在生成相应丰富度的长输出方面仍存在困难。论文通过视频字幕生成任务作为代理任务进行研究，发现开源的LMMs难以持续生成超过约300词的输出。关键解决方案在于提出LongCaption-Agent框架，该框架通过聚合多层次描述来合成长篇字幕数据，并由此创建了一个新的长篇字幕数据集LongCaption-10K。此外，开发了LongCaption-Bench基准测试来全面评估LMMs生成的长篇字幕质量。通过将LongCaption-10K数据集纳入训练，使LMMs能够生成超过1000词的高质量字幕。

链接: https://arxiv.org/abs/2502.15393
作者: Hongchen Wei,Zhihong Tan,Yaosi Hu,Changwen Chen,Zhenzhong Chen
机构: Wuhan University (武汉大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have shown remarkable performance in video understanding tasks and can even process videos longer than one hour. However, despite their ability to handle long inputs, generating outputs with corresponding levels of richness remains a challenge. In this paper, we explore the issue of long outputs in LMMs using video captioning as a proxy task, and we find that open-source LMMs struggle to consistently generate outputs exceeding about 300 words. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model’s output length. However, manually annotating long-caption examples is time-consuming and expensive. To address this, we propose the LongCaption-Agent, a framework that synthesizes long caption data by aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words, while maintaining high output quality. In LongCaption-Bench, our 8B parameter model achieved state-of-the-art performance, even surpassing larger proprietary models. We will release the dataset and code after publication.
zh

[CV-22] he Role of Background Information in Reducing Object Hallucination in Vision-Language Models: Insights from Cutoff API Prompting

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在处理输入图像时偶尔产生与图像内容相矛盾的输出，从而限制其在实际应用中的可靠性的问题。研究的关键在于通过注意力驱动的视觉提示（Attention-driven visual prompting）来保留背景上下文，以此来缓解对象幻觉（object hallucination）现象。研究表明，保持背景信息对于减轻此类问题至关重要。

链接: https://arxiv.org/abs/2502.15389
作者: Masayo Tomita,Katsuhiko Hayashi,Tomoyuki Kaneko
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Vision-Language Models (VLMs) occasionally generate outputs that contradict input images, constraining their reliability in real-world applications. While visual prompting is reported to suppress hallucinations by augmenting prompts with relevant area inside an image, the effectiveness in terms of the area remains uncertain. This study analyzes success and failure cases of Attention-driven visual prompting in object hallucination, revealing that preserving background context is crucial for mitigating object hallucination.
zh

[CV-23] MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

【速读】：该论文旨在解决多模态语言模型（Multimodal Language Models, MLMs）在特定领域任务中的性能提升问题。现有的方法通常依赖单一预训练视觉编码器（single pre-trained vision encoder），而忽略了不同编码器在特定领域的潜在优势。论文的关键解决方案是提出MOVE（Mixture of Vision Encoders），这是一种简单而有效的方法，通过自动路由输入到最合适的预训练编码器（如Unichat、InternViT和Texify）来利用多个预训练编码器，从而增强在包括ChartQA、MMBench和MMMU等多样化基准测试中的表现。这种方法在不增加高分辨率图像处理复杂性的情况下，实现了具有竞争力的准确性。

链接: https://arxiv.org/abs/2502.15381
作者: Matvey Skripkin,Elizaveta Goncharova,Dmitrii Tarasov,Andrey Kuznetsov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision encoder, there is a great variability of specialized encoders that can boost model’s performance in distinct domains. In this work, we propose MOVE (Mixture of Vision Encoders) a simple yet effective approach to leverage multiple pre-trained encoders for specialized multimodal tasks. MOVE automatically routes inputs to the most appropriate encoder among candidates such as Unichat, InternViT, and Texify, thereby enhancing performance across a diverse set of benchmarks, including ChartQA, MMBench, and MMMU. Experimental results demonstrate that MOVE achieves competitive accuracy without incurring the complexities of image slicing for high-resolution images.
zh

[CV-24] Weakly Supervised Video Scene Graph Generation via Natural Language Supervision ICLR2025

【速读】：该论文旨在解决视频场景图生成（VidSGG）在采用弱监督方法时所面临的两个关键问题：视频字幕中的时间性以及动作持续时间的变化。为了解决这些问题，论文提出了一种基于自然语言的视频场景图生成（NL-VSGG）框架。该框架包含两个关键模块：关注时间性的字幕分割（TCS）模块和考虑动作持续时间变化的字幕-帧对齐（ADV）模块。通过这两个模块，NL-VSGG能够有效地利用现有的视频字幕进行训练，并显著提升性能。

链接: https://arxiv.org/abs/2502.15370
作者: Kibum Kim,Kanghoon Yoon,Yeonjun In,Jaehyeong Jeon,Jinyoung Moon,Donghyun Kim,Chanyoung Park
机构: KAIST(韩国科学技术院); ETRI(电子通信研究院); Korea University(高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, ICLR 2025

点击查看摘要

Abstract:Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.
zh

[CV-25] M2LADS Demo: A System for Generating Multimodal Learning Analytics Dashboards AAAI2025

【速读】：该论文旨在解决在计算机辅助学习过程中，如何有效地集成、同步、可视化和分析多模态数据的问题。解决方案的关键在于开发了一个名为M2LADS（System for Generating Multimodal Learning Analytics Dashboards）的web系统，该系统能够整合包括脑电图(EEG)数据、心率指标、眼动追踪数据、网络摄像头视频记录以及活动日志在内的多种数据源，并将其可视化于网页仪表板上，从而提供参与者体验的全面视图，并便于数据的重新标注与分析。

链接: https://arxiv.org/abs/2502.15363
作者: Alvaro Becerra,Roberto Daza,Ruth Cobos,Aythami Morales,Julian Fierrez
机构: GHIA, School of Engineering, Universidad Autónoma de Madrid(马德里自治大学); BiDA-Lab, School of Engineering, Universidad Autónoma de Madrid(马德里自治大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the Workshop on Innovation and Responsibility in AI-Supported Education (iRAISE25) at AAAI 2025

点击查看摘要

Abstract:We present a demonstration of a web-based system called M2LADS (“System for Generating Multimodal Learning Analytics Dashboards”), designed to integrate, synchronize, visualize, and analyze multimodal data recorded during computer-based learning sessions with biosensors. This system presents a range of biometric and behavioral data on web-based dashboards, providing detailed insights into various physiological and activity-based metrics. The multimodal data visualized include electroencephalogram (EEG) data for assessing attention and brain activity, heart rate metrics, eye-tracking data to measure visual attention, webcam video recordings, and activity logs of the monitored tasks. M2LADS aims to assist data scientists in two key ways: (1) by providing a comprehensive view of participants’ experiences, displaying all data categorized by the activities in which participants are engaged, and (2) by synchronizing all biosignals and videos, facilitating easier data relabeling if any activity information contains errors.
zh

[CV-26] PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments

【速读】：该论文旨在解决在半结构化环境中，现有感知模型在检测和预测行人方面存在的显著局限性，尤其是在行人动态多样且频繁遮挡的情况下。论文的关键解决方案在于提出了一种新的Hybrid Multi-Scale Fusion Network (HMFN)，通过有效地捕捉和融合多尺度特征，利用一个精心设计的混合框架结合稀疏卷积和普通卷积来应对这些挑战。此外，为了支持这一研究，作者构建了一个名为Pedestrian-Focused Scene Dataset (PFSD) 的多模态数据集，提供了丰富的标注信息以促进复杂场景中的行人检测。

链接: https://arxiv.org/abs/2502.15342
作者: Yueting Liu,Hanshi Wang,Yunfei Lei,Zhengjun Zha,Weiming Hu,Jin Gao
机构: School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学与技术学院); State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA(中科院自动化研究所多模态人工智能系统国家重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); Beijing University of Aeronautics and Astronautics(北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in autonomous driving perception have revealed exceptional capabilities within structured environments dominated by vehicular traffic. However, current perception models exhibit significant limitations in semi-structured environments, where dynamic pedestrians with more diverse irregular movement and occlusion prevail. We attribute this shortcoming to the scarcity of high-quality datasets in semi-structured scenes, particularly concerning pedestrian perception and prediction. In this work, we present the multi-modal Pedestrian-Focused Scene Dataset(PFSD), rigorously annotated in semi-structured scenes with the format of nuScenes. PFSD provides comprehensive multi-modal data annotations with point cloud segmentation, detection, and object IDs for tracking. It encompasses over 130,000 pedestrian instances captured across various scenarios with varying densities, movement patterns, and occlusions. Furthermore, to demonstrate the importance of addressing the challenges posed by more diverse and complex semi-structured environments, we propose a novel Hybrid Multi-Scale Fusion Network (HMFN). Specifically, to detect pedestrians in densely populated and occluded scenarios, our method effectively captures and fuses multi-scale features using a meticulously designed hybrid framework that integrates sparse and vanilla convolutions. Extensive experiments on PFSD demonstrate that HMFN attains improvement in mean Average Precision (mAP) over existing methods, thereby underscoring its efficacy in addressing the challenges of 3D pedestrian detection in complex semi-structured environments. Coding and benchmark are available.
zh

[CV-27] SentiFormer: Metadata Enhanced Transformer for Image Sentiment Analysis

【速读】：该论文旨在解决图像情感分析任务中，除视觉特征外，元数据（如文本描述和关键词标签）未被充分探索的问题。论文的关键解决方案在于提出了一种新型的元数据增强Transformer模型（SentiFormer），该模型通过融合多种元数据与对应的图像到一个统一框架中，设计了自适应相关性学习模块以动态调整各元数据的重要性，并进一步开发了跨模态融合模块进行最终预测。

链接: https://arxiv.org/abs/2502.15322
作者: Bin Feng,Shulan Ruan,Mingzheng Yang,Dongxuan Han,Huijie Liu,Kai Zhang,Qi Liu
机构: University of Science and Technology of China(中国科学技术大学); State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室); Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(西安交通大学混合增强智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As more and more internet users post images online to express their daily emotions, image sentiment analysis has attracted increasing attention. Recently, researchers generally tend to design different neural networks to extract visual features from images for sentiment analysis. Despite the significant progress, metadata, the data (e.g., text descriptions and keyword tags) for describing the image, has not been sufficiently explored in this task. In this paper, we propose a novel Metadata Enhanced Transformer for sentiment analysis (SentiFormer) to fuse multiple metadata and the corresponding image into a unified framework. Specifically, we first obtain multiple metadata of the image and unify the representations of diverse data. To adaptively learn the appropriate weights for each metadata, we then design an adaptive relevance learning module to highlight more effective information while suppressing weaker ones. Moreover, we further develop a cross-modal fusion module to fuse the adaptively learned representations and make the final prediction. Extensive experiments on three publicly available datasets demonstrate the superiority and rationality of our proposed method.
zh

[CV-28] Research advances on fish feeding behavior recognition and intensity quantification methods in aquaculture

【速读】：该论文旨在解决鱼类摄食行为识别及强度量化的问题。解决方案的关键在于综合运用基于计算机视觉、声学和传感器的单模态方法以及新兴的多模态融合技术，以提高鱼类健康监测、引诱工作指导以及水产养殖效率。

链接: https://arxiv.org/abs/2502.15311
作者: Shulong Zhang,Daoliang Li,Jiayin Zhao,Mingyuan Yao,Yingyi Chen,Yukang Huo,Xiao Liu,Haihua Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 22 pages, 4 figures,

点击查看摘要

Abstract:As a key part of aquaculture management, fish feeding behavior recognition and intensity quantification has been a hot area of great concern to researchers, and it plays a crucial role in monitoring fish health, guiding baiting work and improving aquaculture efficiency. In order to better carry out the related work in the future, this paper firstly reviews the research advances of fish feeding behavior recognition and intensity quantification methods based on computer vision, acoustics and sensors in a single modality. Then the application of the current emerging multimodal fusion in fish feeding behavior recognition and intensity quantification methods is expounded. Finally, the advantages and disadvantages of various techniques are compared and analyzed, and the future research directions are envisioned.
zh

[CV-29] Road Traffic Sign Recognition method using Siamese network Combining Efficient-CNN based Encoder

【速读】：该论文旨在解决交通标志识别（Traffic Signs Recognition, TSR）在复杂环境下的实时高精度和鲁棒性问题，特别是针对运动模糊（motion-blur）和遮挡（occlusion）情况。解决方案的关键在于提出了一种名为IECES-network的网络结构，它包括改进的编码器和Siamese网络架构。该方法通过使用基于Efficient-CNN的编码器提取特征，并利用Siamese神经网络结合对比损失函数（contrastive loss function）来增强模型面对运动模糊和遮挡样本时的鲁棒性。此外，模板分支在训练后可以停止使用以加速实时识别任务，同时减少计算资源和参数规模。最终，通过重新组合特征编码和SoftMax分类层实现交通标志类别的识别。

链接: https://arxiv.org/abs/2502.15307
作者: Zhenghao Xi,Yuchao Shao,Yang Zheng,Xiang Liu,Yaqi Liu,Yitong Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic signs recognition (TSR) plays an essential role in assistant driving and intelligent transportation system. However, the noise of complex environment may lead to motion-blur or occlusion problems, which raise the tough challenge to real-time recognition with high accuracy and robust. In this article, we propose IECES-network which with improved encoders and Siamese net. The three-stage approach of our method includes Efficient-CNN based encoders, Siamese backbone and the fully-connected layers. We firstly use convolutional encoders to extract and encode the traffic sign features of augmented training samples and standard images. Then, we design the Siamese neural network with Efficient-CNN based encoder and contrastive loss function, which can be trained to improve the robustness of TSR problem when facing the samples of motion-blur and occlusion by computing the distance between inputs and templates. Additionally, the template branch of the proposed network can be stopped when executing the recognition tasks after training to raise the process speed of our real-time model, and alleviate the computational resource and parameter scale. Finally, we recombined the feature code and a fully-connected layer with SoftMax function to classify the codes of samples and recognize the category of traffic signs. The results of experiments on the Tsinghua-Tencent 100K dataset and the German Traffic Sign Recognition Benchmark dataset demonstrate the performance of the proposed IECESnetwork. Compared with other state-of-the-art methods, in the case of motion-blur and occluded environment, the proposed method achieves competitive performance precision-recall and accuracy metric average is 88.1%, 86.43% and 86.1% with a 2.9M lightweight scale, respectively. Moreover, processing time of our model is 0.1s per frame, of which the speed is increased by 1.5 times compared with existing methods.
zh

[CV-30] A Novel Riemannian Sparse Representation Learning Network for Polarimetric SAR Image Classification

【速读】：该论文旨在解决PolSAR图像分类中深度学习方法缺乏数学原理指导及欧式空间特征学习导致矩阵结构失真的问题。现有方法在欧式空间中处理复数协方差矩阵，无法准确测量HPD矩阵的几何距离，从而容易产生误分类。为解决这些问题，论文提出了一种新颖的黎曼稀疏表示学习网络（SRSR CNN）。关键在于设计了一个基于超像素的黎曼稀疏表示模型（Riemannian Sparse Representation, SRSR），利用黎曼度量学习稀疏特征，并将其优化过程展开成SRSRnet以自动学习稀疏系数和字典原子。此外，通过添加CNN增强模块来学习上下文高阶特征，进一步提升分类性能。该网络直接利用协方差矩阵作为输入，在黎曼空间中使用黎曼度量学习复矩阵的几何结构和稀疏特征。

链接: https://arxiv.org/abs/2502.15302
作者: Junfei Shi,Mengmeng Nie,Weisi Lin,Haiyan Jin,Junhuai Li,Rui Wang
机构: Department of Computer Science and Technology, Shaanxi Key Laboratory for Network Computing and Security Technology, Xi’an University of Technology(西安理工大学), Xi’an, China; School of Computer Science and Engineering, Nanyang Technological University(南洋理工大学), Singapore; School of Artificial Intelligence and Computer Science, Jiangnan University(江南大学), Wuxi, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Deep learning is an effective end-to-end method for Polarimetric Synthetic Aperture Radar(PolSAR) image classification, but it lacks the guidance of related mathematical principle and is essentially a black-box model. In addition, existing deep models learn features in Euclidean space, where PolSAR complex matrix is commonly converted into a complex-valued vector as the network input, distorting matrix structure and channel relationship. However, the complex covariance matrix is Hermitian positive definite (HPD), and resides on a Riemannian manifold instead of a Euclidean one. Existing methods cannot measure the geometric distance of HPD matrices and easily cause some misclassifications due to inappropriate Euclidean measures. To address these issues, we propose a novel Riemannian Sparse Representation Learning Network (SRSR CNN) for PolSAR images. Firstly, a superpixel-based Riemannian Sparse Representation (SRSR) model is designed to learn the sparse features with Riemannian metric. Then, the optimization procedure of the SRSR model is inferred and further unfolded into an SRSRnet, which can automatically learn the sparse coefficients and dictionary atoms. Furthermore, to learn contextual high-level features, a CNN-enhanced module is added to improve classification performance. The proposed network is a Sparse Representation (SR) guided deep learning model, which can directly utilize the covariance matrix as the network input, and utilize Riemannian metric to learn geometric structure and sparse features of complex matrices in Riemannian space. Experiments on three real PolSAR datasets demonstrate that the proposed method surpasses state-of-the-art techniques in ensuring accurate edge details and correct region homogeneity for classification.
zh

[CV-31] Soybean pod and seed counting in both outdoor fields and indoor laboratories using unions of deep neural networks

【速读】：该论文旨在解决在户外田间和室内实验室中准确计数大豆豆荚和籽粒的问题。解决方案的关键在于开发高效的深度学习模型：对于户外田间，通过注释可见和遮挡的籽粒增强YOLO模型，并结合HQ-SAM (YOLO-SAM) 和领域适应技术(YOLO-DA)，以提高模型的鲁棒性和泛化能力；对于室内实验室，则利用补充Swin Transformer模块的Mask-RCNN (Mask-RCNN-Swin)，并通过少量标注数据生成合成训练图像，从而实现接近完美的计数精度。

链接: https://arxiv.org/abs/2502.15286
作者: Tianyou Jiang,Mingshun Shao,Tianyi Zhang,Xiaoyu Liu,Qun Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic counting soybean pods and seeds in outdoor fields allows for rapid yield estimation before harvesting, while indoor laboratory counting offers greater accuracy. Both methods can significantly accelerate the breeding process. However, it remains challenging for accurately counting pods and seeds in outdoor fields, and there are still no accurate enough tools for counting pods and seeds in laboratories. In this study, we developed efficient deep learning models for counting soybean pods and seeds in both outdoor fields and indoor laboratories. For outdoor fields, annotating not only visible seeds but also occluded seeds makes YOLO have the ability to estimate the number of soybean seeds that are occluded. Moreover, we enhanced YOLO architecture by integrating it with HQ-SAM (YOLO-SAM), and domain adaptation techniques (YOLO-DA), to improve model robustness and generalization across soybean images taken in outdoor fields. Testing on soybean images from the outdoor field, we achieved a mean absolute error (MAE) of 6.13 for pod counting and 10.05 for seed counting. For the indoor setting, we utilized Mask-RCNN supplemented with a Swin Transformer module (Mask-RCNN-Swin), models were trained exclusively on synthetic training images generated from a small set of labeled data. This approach resulted in near-perfect accuracy, with an MAE of 1.07 for pod counting and 1.33 for seed counting across actual laboratory images from two distinct studies.
zh

[CV-32] CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models

【速读】：该论文旨在解决通过人工智能生成的图像与受版权保护的作品之间实质性相似性的评估问题，以辅助解决版权纠纷。解决方案的关键在于提出了一种名为CopyJudge的自动化版权侵权识别框架，该框架利用大型视觉-语言模型（Large Vision-Language Models, LVLMs）来模拟实际法庭过程中的实质性相似性评估。具体而言，CopyJudge采用抽象-过滤-比较测试框架，并结合多LVLM辩论机制，以评估侵权可能性并提供详细的判决理由。此外，基于判决结果，引入了一种基于LVLM的通用缓解策略，能够自动优化侵权提示，避免敏感表达的同时保留非侵权内容。

链接: https://arxiv.org/abs/2502.15278
作者: Shunchang Liu,Zhuan Shi,Lingjuan Lyu,Yaochu Jin,Boi Faltings
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17pages, 8 figures

点击查看摘要

Abstract:Assessing whether AI-generated images are substantially similar to copyrighted works is a crucial step in resolving copyright disputes. In this paper, we propose CopyJudge, an automated copyright infringement identification framework that leverages large vision-language models (LVLMs) to simulate practical court processes for determining substantial similarity between copyrighted images and those generated by text-to-image diffusion models. Specifically, we employ an abstraction-filtration-comparison test framework with multi-LVLM debate to assess the likelihood of infringement and provide detailed judgment rationales. Based on the judgments, we further introduce a general LVLM-based mitigation strategy that automatically optimizes infringing prompts by avoiding sensitive expressions while preserving the non-infringing content. Besides, our approach can be enhanced by exploring non-infringing noise vectors within the diffusion latent space via reinforcement learning, even without modifying the original prompts. Experimental results show that our identification method achieves comparable state-of-the-art performance, while offering superior generalization and interpretability across various forms of infringement, and that our mitigation method could more effectively mitigate memorization and IP infringement without losing non-infringing expressions.
zh

[CV-33] Omnidirectional Image Quality Captioning: A Large-scale Database and A New Model

【速读】：该论文旨在解决全向图像质量评估（Omnidirectional Image Quality Assessment, OIQA）在异构失真情况下的有效性问题。现有方法主要针对同构失真的全向图像进行开发和测试，难以直接应用于异构失真的场景。论文的关键解决方案是建立了一个包含10,000张全向图像的大规模数据库OIQ-10K，其中涵盖了同构和异构失真，并提出了一个新的多任务衍生自适应特征定制全向图像质量评估模型IQCaption360。该模型能够以文本模板的方式生成全向图像的质量描述。实验结果表明，IQCaption360在所提出的OIQ-10K数据库上显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2502.15271
作者: Jiebin Yan,Ziwen Tan,Yuming Fang,Junjie Chen,Wenhui Jiang,Zhou Wang
机构: School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics (江西财经大学计算与人工智能学院), Nanchang 330032, Jiangxi, China; Department of Electrical and Computer Engineering, University of Waterloo (滑铁卢大学电气与计算机工程系), Waterloo, ON, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The fast growing application of omnidirectional images calls for effective approaches for omnidirectional image quality assessment (OIQA). Existing OIQA methods have been developed and tested on homogeneously distorted omnidirectional images, but it is hard to transfer their success directly to the heterogeneously distorted omnidirectional images. In this paper, we conduct the largest study so far on OIQA, where we establish a large-scale database called OIQ-10K containing 10,000 omnidirectional images with both homogeneous and heterogeneous distortions. A comprehensive psychophysical study is elaborated to collect human opinions for each omnidirectional image, together with the spatial distributions (within local regions or globally) of distortions, and the head and eye movements of the subjects. Furthermore, we propose a novel multitask-derived adaptive feature-tailoring OIQA model named IQCaption360, which is capable of generating a quality caption for an omnidirectional image in a manner of textual template. Extensive experiments demonstrate the effectiveness of IQCaption360, which outperforms state-of-the-art methods by a significant margin on the proposed OIQ-10K database. The OIQ-10K database and the related source codes are available at this https URL.
zh

[CV-34] SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training ICLR2025

【速读】：该论文旨在解决3D手部姿态估计在大规模自然场景图像预训练中的不足。现有方法未能充分利用来自自然视频中多样化手部图像的潜力。为了解决这一问题，论文提出了一种名为SimHand的框架，通过对比学习方法从大量自然场景手部图像中进行预训练。关键在于收集超过200万的手部图像，并设计了一种新的对比学习方法，该方法聚焦于非相同样本之间相似手部姿态的特征嵌入，使得相似手部对在特征空间中更接近。此外，该方法自适应地加权对比学习损失，以进一步提升性能。实验结果表明，SimHand在多个数据集上的表现显著优于现有方法，包括FreiHand、DexYCB和AssemblyHands。

链接: https://arxiv.org/abs/2502.15251
作者: Nie Lin,Takehiko Ohkawa,Yifei Huang,Mingfang Zhang,Minjie Cai,Ming Li,Ryosuke Furuta,Yoichi Sato
机构: The University of Tokyo (东京大学); Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025. arXiv admin note: text overlap with arXiv:2409.09714

点击查看摘要

Abstract:We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SimHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. Our method not only learns from similar samples but also adaptively weights the contrastive learning loss based on inter-sample distance, leading to additional performance gains. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (PeCLR) in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands. Our code is available at this https URL. Comments: ICLR 2025. arXiv admin note: text overlap with arXiv:2409.09714 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.15251 [cs.CV] (or arXiv:2502.15251v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.15251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-35] An ocean front detection and tracking algorithm

【速读】：该论文旨在解决大型海洋锋面检测与追踪中的不连续性和不准确性问题，并提出了一种基于贝叶斯决策和度量空间的自动锋面检测与追踪算法。关键解决方案在于引入了前沿合并、填补和环删除技术以增强连续性，并定义了不同日期间锋面距离在度量空间中的功能分析，这些技术可以迁移到计算机视觉的其他领域如边缘检测与追踪。

链接: https://arxiv.org/abs/2502.15250
作者: Yishuo Wang,Feng Zhou
机构: School of Oceanography, Shanghai Jiao Tong University (上海交通大学海洋学院), China; State Key Laboratory of Satellite Ocean Environment Dynamics, Second Institute of Oceanography, MNR (国家卫星海洋应用中心第二海洋研究所重点实验室), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ocean front is defined as the interface between different water masses and plays a vital role in the evolution of many physical phenomena. Previous detection methods are based on histogram, Lyapunov exponent, gradient and machine learning. These algorithms, however, introduce discontinuity, inaccuracy, use less information or just approaching traditional results. Moreover, automatic front tracking algrorithm is not open source in preceding studies. This paper foucuses on large-scale ocean fronts and proposes an automatic front detection and tracking algorithm based on Bayesian decision and metric space. In this, front merging, filling and ring deletion are put forward to enhance continuity. The distance between fronts in different days is firstly defined and is well-defined in metric space for functional analysis. These technologies can be migrated to other areas of computer vision such as edge detection and tracking.
zh

[CV-36] AutoMR: A Universal Time Series Motion Recognition Pipeline

【速读】：该论文旨在解决两个主要问题：一是多模态数据集中传感器数据格式和参数的多样性，这通常需要特定任务的机器学习实现；二是优化模型性能所需的超参数调整的复杂性和时间消耗。解决方案的关键在于提出了一种端到端的自动化运动识别（AutoMR）管道，该管道集成了数据预处理、模型训练、超参数调优及评估等功能，并采用QuartzNet为核心模型，实现了全面的指标跟踪，从而在10个不同的数据集上展示了其有效性，达到了最先进的性能水平。

链接: https://arxiv.org/abs/2502.15228
作者: Likun Zhang,Sicheng Yang,Zhuo Wang,Haining Liang,Junxiao Shen
机构: X-Intelligence Labs; University of California, Berkeley; HKUST (Guangzhou); University of Bristol
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 figures

点击查看摘要

Abstract:In this paper, we present an end-to-end automated motion recognition (AutoMR) pipeline designed for multimodal datasets. The proposed framework seamlessly integrates data preprocessing, model training, hyperparameter tuning, and evaluation, enabling robust performance across diverse scenarios. Our approach addresses two primary challenges: 1) variability in sensor data formats and parameters across datasets, which traditionally requires task-specific machine learning implementations, and 2) the complexity and time consumption of hyperparameter tuning for optimal model performance. Our library features an all-in-one solution incorporating QuartzNet as the core model, automated hyperparameter tuning, and comprehensive metrics tracking. Extensive experiments demonstrate its effectiveness on 10 diverse datasets, achieving state-of-the-art performance. This work lays a solid foundation for deploying motion-capture solutions across varied real-world applications.
zh

[CV-37] FlipConcept: Tuning-Free Multi-Concept Personalization for Text-to-Image Generation

【速读】：该论文旨在解决在复杂场景中多对象文本到图像（Text-to-Image, T2I）生成方法因非个性化区域失真而导致性能下降的问题。解决方案的关键在于引入FlipConcept方法，通过指导外观注意力（guided appearance attention）准确模仿个性化概念的外观，采用掩码引导噪声混合（mask-guided noise mixing）保护非个性化区域，并应用背景稀释（background dilution）以最小化属性泄漏。这些技术共同确保了在无需额外调优的情况下，该方法能够优于现有模型，在单个及多个个性化概念推断中均表现出色。

链接: https://arxiv.org/abs/2502.15203
作者: Young Beom Woo,Sun Eung Kim
机构: Korea University (韩国国立庆北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Recently, methods that integrate multiple personalized concepts into a single image have garnered significant attention in the field of text-to-image (T2I) generation. However, existing methods experience performance degradation in complex scenes with multiple objects due to distortions in non-personalized regions. To address this issue, we propose FlipConcept, a novel approach that seamlessly integrates multiple personalized concepts into a single image without requiring additional tuning. We introduce guided appearance attention to accurately mimic the appearance of a personalized concept as intended. Additionally, we introduce mask-guided noise mixing to protect non-personalized regions during editing. Lastly, we apply background dilution to minimize attribute leakage, which is the undesired blending of personalized concept attributes with other objects in the image. In our experiments, we demonstrate that the proposed method, despite not requiring tuning, outperforms existing models in both single and multiple personalized concept inference.
zh

[CV-38] UrbanSAM: Learning Invariance-Inspired Adapters for Segment Anything Models in Urban Construction

【速读】：该论文旨在解决从遥感图像中提取和分割复杂城市环境中形态各异、尺度多变的对象这一挑战。论文的关键解决方案是提出UrbanSAM，这是一种定制的Segment Anything Model (SAM)，特别设计用于分析复杂的城市环境，并通过处理来自遥感观测的尺度效应来提升性能。UrbanSAM引入了一种具备Uscaling-Adapter的新型可学习提示器，该适配器遵循不变性标准，能够捕捉对象的多尺度上下文信息并适应任意尺度变化。此外，通过掩码交叉注意力操作对Uscaling-Adapter和主干编码器特征进行对齐，使主干编码器继承适配器的多尺度聚合能力，从而增强分割性能。

链接: https://arxiv.org/abs/2502.15199
作者: Chenyu Li,Danfeng Hong,Bing Zhang,Yuxuan Li,Gustau Camps-Valls,Xiao Xiang Zhu,Jocelyn Chanussot
机构: School of Mathematics and Statistics, Southeast University(东南大学); Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航空航天信息研究所); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences(中国科学院大学电子电气与通信工程学院); Image Processing Laboratory (IPL), Universitat de València(瓦伦西亚大学图像处理实验室); Data Science in Earth Observation, Technical University of Munich(慕尼黑工业大学地球观测数据科学); Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, Grenoble(格勒诺布尔阿尔卑斯大学, Inria, CNRS, 格勒诺布尔国立理工学院, LJK, 格勒诺布尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object extraction and segmentation from remote sensing (RS) images is a critical yet challenging task in urban environment monitoring. Urban morphology is inherently complex, with irregular objects of diverse shapes and varying scales. These challenges are amplified by heterogeneity and scale disparities across RS data sources, including sensors, platforms, and modalities, making accurate object segmentation particularly demanding. While the Segment Anything Model (SAM) has shown significant potential in segmenting complex scenes, its performance in handling form-varying objects remains limited due to manual-interactive prompting. To this end, we propose UrbanSAM, a customized version of SAM specifically designed to analyze complex urban environments while tackling scaling effects from remotely sensed observations. Inspired by multi-resolution analysis (MRA) theory, UrbanSAM incorporates a novel learnable prompter equipped with a Uscaling-Adapter that adheres to the invariance criterion, enabling the model to capture multiscale contextual information of objects and adapt to arbitrary scale variations with theoretical guarantees. Furthermore, features from the Uscaling-Adapter and the trunk encoder are aligned through a masked cross-attention operation, allowing the trunk encoder to inherit the adapter’s multiscale aggregation capability. This synergy enhances the segmentation performance, resulting in more powerful and accurate outputs, supported by the learned adapter. Extensive experimental results demonstrate the flexibility and superior segmentation performance of the proposed UrbanSAM on a global-scale dataset, encompassing scale-varying urban objects such as buildings, roads, and water.
zh

[CV-39] Image Translation-Based Unsupervised Cross-Modality Domain Adaptation for Medical Image Segmentation

【速读】：该论文旨在解决在医学图像领域中，由于标注需求的专业性和高昂成本导致监督学习面临的挑战，以及不同成像设备和协议引起的模态差异（域偏移）问题。为了解决这些问题，论文提出了一种基于图像转换的无监督跨模态领域适应方法，将带有标注的源模态图像转换为目标模态的未标注图像，并利用其标注实现目标模态的有监督学习。关键在于通过自训练方法克服转换后的伪图像与真实图像之间的细微差异，从而进一步提升深度学习任务的性能。

链接: https://arxiv.org/abs/2502.15193
作者: Tao Yang,Lisheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure. arXiv admin note: substantial text overlap with arXiv:2303.07674

点击查看摘要

Abstract:Supervised deep learning usually faces more challenges in medical images than in natural images. Since annotations in medical images require the expertise of doctors and are more time-consuming and expensive. Thus, some researchers turn to unsupervised learning methods, which usually face inevitable performance drops. In addition, medical images may have been acquired at different medical centers with different scanners and under different image acquisition protocols, so the modalities of the medical images are often inconsistent. This modality difference (domain shift) also reduces the applicability of deep learning methods. In this regard, we propose an unsupervised crossmodality domain adaptation method based on image translation by transforming the source modality image with annotation into the unannotated target modality and using its annotation to achieve supervised learning of the target modality. In addition, the subtle differences between translated pseudo images and real images are overcome by self-training methods to further improve the task performance of deep learning. The proposed method showed mean Dice Similarity Coefficient (DSC) and Average Symmetric Surface Distance (ASSD) of 0.8351 \pm 0.1152 and 1.6712 \pm 2.1948 for vestibular schwannoma (VS), 0.8098 \pm 0.0233 and 0.2317 \pm 0.1577 for cochlea on the VS and cochlea segmentation task of the Cross-Modality Domain Adaptation (crossMoDA 2022) challenge validation phase leaderboard.
zh

[CV-40] Hierarchical Context Transformer for Multi-level Semantic Scene Understanding

【速读】：该论文旨在解决手术场景多层次语义理解（多级语义场景理解, MSSU）的问题，具体表现为阶段识别、步骤识别、动作与器械检测。为了解决这一问题，论文提出了一种新颖的分层上下文变换器网络（Hierarchical Context Transformer, HCT），并通过设计层次关系聚合模块（Hierarchical Relation Aggregation Module, HRAM）来同时关联多层次交互信息，并增强特定任务的特征。此外，论文还引入了跨任务对比学习（Inter-Task Contrastive Learning, ICL）以通过吸收其他任务的互补信息来指导模型学习任务特有特征。为了减少计算成本，提出了HCT+版本，通过整合空间和时间适配器，在较少可调参数的情况下实现竞争性能。这些方法共同提升了不同任务的表征学习能力，并在多个数据集上展示了优越性能。

链接: https://arxiv.org/abs/2502.15184
作者: Luoying Hao,Yan Hu,Yang Yue,Li Wu,Huazhu Fu,Jinming Duan,Jiang Liu
机构: Research Institute of Trustworthy Autonomous Systems and Dept. of Computer Science and Engineering, Southern University of Science and Technology, China(南方科技大学计算机科学与工程系可信自主系统研究院);
School of Computer Science, University of Birmingham, UK(英国伯明翰大学计算机学院);
Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, UK(英国曼彻斯特大学健康科学学院信息学、影像和数据科学部);
MGI Tech Co., Ltd., China(中国MGI Tech有限公司);
Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局高性能计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by the IEEE TCSVT

点击查看摘要

Abstract:A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition – step recognition – action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at this https URL.
zh

[CV-41] OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework ICLR2025

【速读】：该论文旨在解决复杂交通环境中占用预测（Occupancy Forecasting）的高计算需求问题，从而提高其在边缘设备上的可行性和效率。关键在于提出了一种名为OccProphet的新框架，该框架包含三个轻量级组件：Observer、Forecaster和Refiner。通过引入高效的4D聚合与三重注意力融合方法，OccProphet显著降低了计算成本，相比最先进的Cam4DOcc实现了58%~78%的计算成本减少及2.6倍的速度提升，并且在nuScenes、Lyft-Level5和nuScenes-Occupancy数据集上实现了4%~18%的预测精度提升。

链接: https://arxiv.org/abs/2502.15180
作者: Junliang Chen,Huaiyuan Xu,Yi Wang,Lap-Pui Chau
机构: The Hong Kong Polytechnic University
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICLR2025

点击查看摘要

Abstract:Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, i.e., OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while improving forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D multi-frame voxels using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58% \sim 78% of the computational cost with a 2.6 \times speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4% \sim 18% relatively higher forecasting accuracy. Code and models are publicly available at this https URL.
zh

[CV-42] Nonlinear Dynamical Systems for Automatic Face Annotation in Head Tracking and Pose Estimation

【速读】：该论文旨在评估扩展卡尔曼滤波器（Extended Kalman Filter, EKF）和无迹卡尔曼滤波器（Unscented Kalman Filter, UKF）在确定性和随机环境下跟踪三维面部运动的性能。研究的关键在于通过分析噪声环境下的表现差异，揭示UKF在无噪声环境中通过捕捉高阶非线性从而实现较低均方误差（MSE），而EKF在引入随机噪声后展现出更好的鲁棒性，维持较低的MSE。论文的关键解决方案在于比较这两种滤波技术在不同条件下的表现，从而为选择适用于三维面部跟踪应用（如动作捕捉和面部识别）的适当滤波方法提供实用指导。

链接: https://arxiv.org/abs/2502.15179
作者: Thoa Thieu,Roderick Melnik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 10 figures

点击查看摘要

Abstract:Facial landmark tracking plays a vital role in applications such as facial recognition, expression analysis, and medical diagnostics. In this paper, we consider the performance of the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF) in tracking 3D facial motion in both deterministic and stochastic settings. We first analyze a noise-free environment where the state transition is purely deterministic, demonstrating that UKF outperforms EKF by achieving lower mean squared error (MSE) due to its ability to capture higher-order nonlinearities. However, when stochastic noise is introduced, EKF exhibits superior robustness, maintaining lower mean square error (MSE) compared to UKF, which becomes more sensitive to measurement noise and occlusions. Our results highlight that UKF is preferable for high-precision applications in controlled environments, whereas EKF is better suited for real-world scenarios with unpredictable noise. These findings provide practical insights for selecting the appropriate filtering technique in 3D facial tracking applications, such as motion capture and facial recognition.
zh

[CV-43] Methods and Trends in Detecting Generated Images: A Comprehensive Review

【速读】：该论文旨在解决合成图像检测领域中最新进展覆盖不足的问题，特别是缺乏对利用多模态框架提升司法分析效果的方法的全面综述。论文的关键在于系统性地审查当前最先进的合成图像检测方法，并将其分类为有意义的体系结构，同时概述有助于进一步研究和基准测试的大规模公开数据集。

链接: https://arxiv.org/abs/2502.15176
作者: Arpan Mahara,Naphtali Rishe
机构: Knight Foundation School of Computing and Information Sciences (骑士计算与信息科学学院), Florida International University (佛罗里达国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages, 4 Figures, 10 Tables

点击查看摘要

Abstract:The proliferation of generative models, such as Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs), has enabled the synthesis of high-quality multimedia data. However, these advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm. Recognizing these challenges, researchers have increasingly focused on developing methodologies to detect synthesized data effectively, aiming to mitigate potential risks. Prior reviews have primarily focused on deepfake detection and often lack coverage of recent advancements in synthetic image detection, particularly methods leveraging multimodal frameworks for improved forensic analysis. To address this gap, the present survey provides a comprehensive review of state-of-the-art methods for detecting and classifying synthetic images generated by advanced generative AI models. This review systematically examines core detection methodologies, identifies commonalities among approaches, and categorizes them into meaningful taxonomies. Furthermore, given the crucial role of large-scale datasets in this field, we present an overview of publicly available datasets that facilitate further research and benchmarking in synthetic data detection.
zh

[CV-44] M3-AGIQA: Multimodal Multi-Round Multi-Aspect AI-Generated Image Quality Assessment

【速读】：该论文旨在解决AI生成图像（AI-generated image, AGI）模型质量评估中的多维度挑战，包括感知质量、提示一致性及真实性。论文提出了一种名为M3-AGIQA的综合评估框架，其关键在于利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）作为联合文本和图像编码器，并通过低秩适应（Low-Rank Adaptation, LoRA）微调技术将高级描述能力从在线MLLMs提炼到本地模型中。此外，该框架采用结构化的多轮评估机制，并结合由xLSTM和回归头构成的预测器来处理序列日志并预测平均意见分数（Mean Opinion Scores, MOSs），从而有效捕捉AGI质量的细微差异。

链接: https://arxiv.org/abs/2502.15167
作者: Chuan Cui,Kejiang Chen,Zhihua Wei,Wen Shen,Weiming Zhang,Nenghai Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The rapid advancement of AI-generated image (AGI) models has introduced significant challenges in evaluating their quality, which requires considering multiple dimensions such as perceptual quality, prompt correspondence, and authenticity. To address these challenges, we propose M3-AGIQA, a comprehensive framework for AGI quality assessment that is Multimodal, Multi-Round, and Multi-Aspect. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) as joint text and image encoders and distills advanced captioning capabilities from online MLLMs into a local model via Low-Rank Adaptation (LoRA) fine-tuning. The framework includes a structured multi-round evaluation mechanism, where intermediate image descriptions are generated to provide deeper insights into the quality, correspondence, and authenticity aspects. To align predictions with human perceptual judgments, a predictor constructed by an xLSTM and a regression head is incorporated to process sequential logits and predict Mean Opinion Scores (MOSs). Extensive experiments conducted on multiple benchmark datasets demonstrate that M3-AGIQA achieves state-of-the-art performance, effectively capturing nuanced aspects of AGI quality. Furthermore, cross-dataset validation confirms its strong generalizability. The code is available at this https URL.
zh

[CV-45] HOpenCls: Training Hyperspectral Image Open-Set Classifiers in Their Living Environments

【速读】：该论文旨在解决高光谱图像（HSI）开放集分类在实际部署环境中面临的挑战，即分类器需要同时识别已知类别并拒绝未知类别。由于现有方法依赖于完全可分离的辅助未知类别数据，这不仅假设严格且需要大量人工标注，因此存在局限性。为了解决这一问题，论文提出了一种新的框架HOpenCls，利用未标记的野生数据——即已知类与未知类的混合数据。论文的关键创新在于将开放集HSI分类问题转化为正例-未例（Positive-Unlabeled, PU）学习问题，并通过多标签策略连接PU学习与开放集HSI分类。此外，文中引入梯度收缩和梯度扩展模块，以处理与野生数据相关的异常梯度权重观察结果，从而使PU学习问题得以解决。实验结果表明，利用野生数据有潜力显著提升复杂现实场景中的开放集HSI分类性能。

链接: https://arxiv.org/abs/2502.15163
作者: Hengwei Zhao,Xinyu Wang,Zhuo Zheng,Jingtao Li,Yanfei Zhong
机构: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, China (武汉大学); Department of Computer Science, Stanford University, United States (斯坦福大学); School of Remote Sensing and Information Engineering, Wuhan University, China (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) open-set classification is critical for HSI classification models deployed in real-world environments, where classifiers must simultaneously classify known classes and reject unknown classes. Recent methods utilize auxiliary unknown classes data to improve classification performance. However, the auxiliary unknown classes data is strongly assumed to be completely separable from known classes and requires labor-intensive annotation. To address this limitation, this paper proposes a novel framework, HOpenCls, to leverage the unlabeled wild data-that is the mixture of known and unknown classes. Such wild data is abundant and can be collected freely during deploying classifiers in their living environments. The key insight is reformulating the open-set HSI classification with unlabeled wild data as a positive-unlabeled (PU) learning problem. Specifically, the multi-label strategy is introduced to bridge the PU learning and open-set HSI classification, and then the proposed gradient contraction and gradient expansion module to make this PU learning problem tractable from the observation of abnormal gradient weights associated with wild data. Extensive experiment results demonstrate that incorporating wild data has the potential to significantly enhance open-set HSI classification in complex real-world scenarios.
zh

[CV-46] Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation

【速读】：该论文旨在解决半监督语义分割（Semi-supervised Semantic Segmentation, SSSS）中的几个关键挑战：耦合（coupling）、确认偏差（confirmation bias）和边界模糊（boundary blur）。为了解决这些问题，论文提出了一种名为CW-BASS的新框架。其关键是通过引入置信加权损失函数（confidence-weighted loss function）减少耦合影响，采用动态阈值机制（dynamic thresholding mechanism）缓解确认偏差，并利用边界感知模块（boundary-aware module）解决边界模糊问题。此外，还通过置信衰减策略（confidence decay strategy）逐步精炼伪标签以减少标签噪声。这些方法共同提高了半监督语义分割在有限标注数据条件下的性能。

链接: https://arxiv.org/abs/2502.15152
作者: Ebenezer Tarubinga,Jenifer Kalafatovich Espinoza
机构: Korea University (韩国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilising unlabeled data alongside limited labeled samples. Existing SSSS methods often face challenges such as coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by insufficient boundary-awareness and ambiguous edge information. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS) remain under-explored in SSSS. Specifically, our approach: (1) reduces coupling through a confidence-weighted loss function that adjusts the influence of pseudo-labels based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) resolves boundary blur with a boundary-aware module that enhances segmentation accuracy near object boundaries, and (4) reduces label noise with a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on the Pascal VOC 2012 and Cityscapes demonstrate that our method achieves state-of-the-art performance. Moreover, using only 1/8 or 12.5% of labeled data, our method achieves a mIoU of 75.81 on Pascal VOC 2012, highlighting its effectiveness in limited-label settings.
zh

[CV-47] ransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

【速读】：该论文旨在解决在训练特定任务的子二次架构（subquadratic architectures）时资源消耗大且耗时的问题。解决方案的关键在于提出了一种跨架构训练方法TransMamba，通过两阶段策略加速新Mamba模型的训练，并引入了Weight Subcloning and Adaptive Bidirectional distillation (WSAB) 方法以及跨模态学习模块，以有效转移现有Transformer模型的知识到Mamba架构，从而提升其在图像分类、视觉问答和文本视频检索等任务上的性能。

链接: https://arxiv.org/abs/2502.15130
作者: Xiuwei Chen,Sihao Lin,Xiao Dong,Zisheng Chen,Meng Cao,Jianhua Han,Hang Xu,Xiaodan Liang
机构: Sun Yat-sen University (中山大学); Royal Melbourne Institute of Technology (皇家墨尔本理工大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training specialized subquadratic architectures from scratch for certain tasks is both resource-intensive and time-consuming. As a motivator, we explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. Concerning architecture disparities, we project the intermediate features into an aligned latent space before transferring knowledge. On top of that, a Weight Subcloning and Adaptive Bidirectional distillation method (WSAB) is introduced for knowledge transfer without limitations on varying layer counts. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba’s visual features, enhancing the cross-modal interaction capabilities of Mamba architecture. Despite using less than 75% of the training data typically required for training from scratch, TransMamba boasts substantially stronger performance across various network architectures and downstream tasks, including image classification, visual question answering, and text-video retrieval. The code will be publicly available.
zh

[CV-48] DAM-Seg: Anatomically accurate cardiac segmentation using Dense Associative Networks

【速读】：该论文旨在解决深度学习心脏分割中解剖结构不正确的问题。传统方法通过引入辅助模块来改进分割输出或确保特定点之间的一致性，但这些方法通常增加网络复杂度，需要额外的训练，并且在可见性差的情况下可能缺乏鲁棒性。论文的关键解决方案在于提出了一种基于变换器的架构，利用密集关联网络学习和保留心脏输入特有的模式。这种方法通过限制网络记忆有限的模式集，并在前向传播过程中使用加权和来强制输出的解剖正确性，从而提高模型在低可见度情况下的鲁棒性。

链接: https://arxiv.org/abs/2502.15128
作者: Zahid Ullah,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Deep learning-based cardiac segmentation has seen significant advancements over the years. Many studies have tackled the challenge of anatomically incorrect segmentation predictions by introducing auxiliary modules. These modules either post-process segmentation outputs or enforce consistency between specific points to ensure anatomical correctness. However, such approaches often increase network complexity, require separate training for these modules, and may lack robustness in scenarios with poor visibility. To address these limitations, we propose a novel transformer-based architecture that leverages dense associative networks to learn and retain specific patterns inherent to cardiac inputs. Unlike traditional methods, our approach restricts the network to memorize a limited set of patterns. During forward propagation, a weighted sum of these patterns is used to enforce anatomical correctness in the output. Since these patterns are input-independent, the model demonstrates enhanced robustness, even in cases with poor visibility. The proposed pipeline was evaluated on two publicly available datasets, CAMUS and CardiacNet. Experimental results indicate that our model consistently outperforms baseline approaches across all metrics, highlighting its effectiveness and reliability for cardiac segmentation tasks.
zh

[CV-49] CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models

【速读】：该论文旨在解决自动驾驶系统在处理罕见但可能造成严重后果的安全关键场景时的安全保障挑战。现有的研究虽然探索了生成这些安全关键场景用于测试的方法，但缺乏有效将这些场景融入策略学习以增强安全性的方法。此外，针对自动驾驶车辆（AV）行为模式和性能瓶颈的自适应训练课程开发也未得到充分研究。

论文提出的关键解决方案是CurricuVLM框架，它利用视觉-语言模型（VLMs）实现个性化课程学习。该方法独特之处在于利用VLMs的多模态理解能力来分析代理行为、识别性能弱点，并动态生成定制化的训练场景以适应课程调整。通过综合分析带有叙述性描述的不安全驾驶情况，CurricuVLM进行深入推理以评估AV的能力并识别关键行为模式。该框架随后合成针对性训练场景，以应对这些已识别的局限性，从而实现有效的个性化课程学习。

链接: https://arxiv.org/abs/2502.15119
作者: Zihao Sheng,Zilin Huang,Yansong Qu,Yue Leng,Sruthi Bhavanam,Sikai Chen
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Purdue University(普渡大学); Google(谷歌)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring safety in autonomous driving systems remains a critical challenge, particularly in handling rare but potentially catastrophic safety-critical scenarios. While existing research has explored generating safety-critical scenarios for autonomous vehicle (AV) testing, there is limited work on effectively incorporating these scenarios into policy learning to enhance safety. Furthermore, developing training curricula that adapt to an AV’s evolving behavioral patterns and performance bottlenecks remains largely unexplored. To address these challenges, we propose CurricuVLM, a novel framework that leverages Vision-Language Models (VLMs) to enable personalized curriculum learning for autonomous driving agents. Our approach uniquely exploits VLMs’ multimodal understanding capabilities to analyze agent behavior, identify performance weaknesses, and dynamically generate tailored training scenarios for curriculum adaptation. Through comprehensive analysis of unsafe driving situations with narrative descriptions, CurricuVLM performs in-depth reasoning to evaluate the AV’s capabilities and identify critical behavioral patterns. The framework then synthesizes customized training scenarios targeting these identified limitations, enabling effective and personalized curriculum learning. Extensive experiments on the Waymo Open Motion Dataset show that CurricuVLM outperforms state-of-the-art baselines across both regular and safety-critical scenarios, achieving superior performance in terms of navigation success, driving efficiency, and safety metrics. Further analysis reveals that CurricuVLM serves as a general approach that can be integrated with various RL algorithms to enhance autonomous driving systems. The code and demo video are available at: this https URL.
zh

[CV-50] Assessing a Single Students Concentration on Learning Platforms: A Machine Learning-Enhanced EEG-Based Framework

【速读】：该论文旨在解决在线学习环境中个体学生注意力状态分类的问题。解决方案的关键在于设计了一个专门的数据处理管道，包括定制的机器学习模型训练。具体而言，该方法涵盖了从EEG数据获取到预处理，统计特征提取（alpha, beta, theta, delta, gamma波段），特征选择优化，以及超参数微调以提升分类准确性。通过使用Muse头带（第二代）采集EEG信号，并采用个性化调整的随机森林模型进行分析，实现了在计算机和虚拟现实两种在线学习环境下的高精度分类，分别达到了97.6%和98%的测试准确率。

链接: https://arxiv.org/abs/2502.15107
作者: Zewen Zhuo,Mohamad Najafi,Hazem Zein,Amine Nait-Ali
机构: University Paris-Est Créteil(巴黎-埃松大学); LISSI Laboratory( LISSI实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study introduces a specialized pipeline designed to classify the concentration state of an individual student during online learning sessions by training a custom-tailored machine learning model. Detailed protocols for acquiring and preprocessing EEG data are outlined, along with the extraction of fifty statistical features from five EEG signal bands: alpha, beta, theta, delta, and gamma. Following feature extraction, a thorough feature selection process was conducted to optimize the data inputs for a personalized analysis. The study also explores the benefits of hyperparameter fine-tuning to enhance the classification accuracy of the student’s concentration state. EEG signals were captured from the student using a Muse headband (Gen 2), equipped with five electrodes (TP9, AF7, AF8, TP10, and a reference electrode NZ), during engagement with educational content on computer-based e-learning platforms. Employing a random forest model customized to the student’s data, we achieved remarkable classification performance, with test accuracies of 97.6% in the computer-based learning setting and 98% in the virtual reality setting. These results underscore the effectiveness of our approach in delivering personalized insights into student concentration during online educational activities.
zh

[CV-51] Hardware-Friendly Static Quantization Method for Video Diffusion Transformers

【速读】：该论文旨在解决在资源受限设备上部署视频生成扩散Transformer模型（Video Diffusion Transformer）的高效性问题，尤其是在无法使用动态量化技术的情况下。论文的关键解决方案在于提出了一种新颖的后训练静态量化方法，用于OpenSora模型，实现了与FP16及动态量化的ViDiT-Q方法相当的视频质量，通过CLIP和VQA度量评估。具体而言，该方法利用每步校准数据，并采用通道级别的权重量化和张量级别的激活量化，进一步应用平滑量化技术以获得高质量的视频输出。

链接: https://arxiv.org/abs/2502.15077
作者: Sanghyun Yi,Qingfeng Liu,Mostafa El-Khamy
机构: California Institute of Technology; Samsung Semiconductor Inc. (三星半导体股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Transformers for video generation have gained significant research interest since the impressive performance of SORA. Efficient deployment of such generative-AI models on GPUs has been demonstrated with dynamic quantization. However, resource-constrained devices cannot support dynamic quantization, and need static quantization of the models for their efficient deployment on AI processors. In this paper, we propose a novel method for the post-training quantization of OpenSora\citeopensora, a Video Diffusion Transformer, without relying on dynamic quantization techniques. Our approach employs static quantization, achieving video quality comparable to FP16 and dynamically quantized ViDiT-Q methods, as measured by CLIP, and VQA metrics. In particular, we utilize per-step calibration data to adequately provide a post-training statically quantized model for each time step, incorporating channel-wise quantization for weights and tensor-wise quantization for activations. By further applying the smooth-quantization technique, we can obtain high-quality video outputs with the statically quantized models. Extensive experimental results demonstrate that static quantization can be a viable alternative to dynamic quantization for video diffusion transformers, offering a more efficient approach without sacrificing performance.
zh

[CV-52] Synth It Like KITTI: Synthetic Data Generation for Object Detection in Driving Scenarios

【速读】：该论文旨在解决三维物体检测在LiDAR点云中的虚拟与现实世界之间可迁移性差的问题。关键解决方案在于提出了一种基于CARLA模拟器的数据生成流程，通过领域随机化策略和精细建模，能够在合成数据上训练出具有强泛化能力的对象检测器，并将其应用于KITTI数据集。此外，通过比较不同虚拟传感器变体，研究了可能导致领域差距的传感器属性。最后，使用少量真实数据微调几乎达到了基准性能，而使用完整训练集则略微超越了基准。

链接: https://arxiv.org/abs/2502.15076
作者: Richard Marcus,Christian Vogel,Inga Jatzkowski,Niklas Knoop,Marc Stamminger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint, to appear in ROBOVIS 2025

点击查看摘要

Abstract:An important factor in advancing autonomous driving systems is simulation. Yet, there is rather small progress for transferability between the virtual and real world. We revisit this problem for 3D object detection on LiDAR point clouds and propose a dataset generation pipeline based on the CARLA simulator. Utilizing domain randomization strategies and careful modeling, we are able to train an object detector on the synthetic data and demonstrate strong generalization capabilities to the KITTI dataset. Furthermore, we compare different virtual sensor variants to gather insights, which sensor attributes can be responsible for the prevalent domain gap. Finally, fine-tuning with a small portion of real data almost matches the baseline and with the full training set slightly surpasses it.
zh

[CV-53] Fostering Inclusion: A Virtual Reality Experience to Raise Awareness of Dyslexia-Related Barriers in University Settings

【速读】：该论文旨在解决如何提高非 dyslexic （ developmental dyslexia）个体对 dyslexia 的理解和认知，以便更好地促进 dyslexic 学生在大学环境中的包容性。解决方案的关键在于设计并实现一个虚拟现实（VR）体验，使参与者能够亲身体验 dyslexic 学生所面临的挑战，包括阅读障碍、方向指示错误及缺乏帮助等情境，从而提升他们对 dyslexia 相关困难的认知和同理心。

链接: https://arxiv.org/abs/2502.15039
作者: José Manuel Alcalde-Llergo,Pilar Aparicio-Martínez,Andrea Zingoni,Sara Pinzi,Enrique Yeguas-Bolívar
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:This work introduces the design, implementation, and validation of a virtual reality (VR) experience aimed at promoting the inclusion of individuals with dyslexia in university settings. Unlike traditional awareness methods, this immersive approach offers a novel way to foster empathy by allowing participants to experience firsthand the challenges faced by students with dyslexia. Specifically, the experience raises awareness by exposing non-dyslexic individuals to the difficulties commonly encountered by dyslexic students. In the virtual environment, participants explore a virtual campus with multiple buildings, navigating between them while completing tasks and simultaneously encountering barriers that simulate some of the challenges faced by individuals with dyslexia. These barriers include reading signs with shifting letters, following directional arrows that may point incorrectly, and dealing with a lack of assistance. The campus is a comprehensive model featuring both indoor and outdoor spaces and supporting various modes of locomotion. To validate the experience, more than 30 non-dyslexic participants from the university environment, mainly professors and students, evaluated it through ad hoc satisfaction surveys. The results indicated heightened awareness of the barriers encountered by students with dyslexia, with participants deeming the experience a valuable tool for increasing visibility and fostering understanding of dyslexic students.
zh

[CV-54] Simpler Fast Vision Transformers with a Jumbo CLS Token

【速读】：该论文旨在通过改进视觉变换器（Vision Transformers, ViTs）的全局处理机制，提升模型在保持吞吐量的同时提高准确率。关键解决方案是提出了一种名为Jumbo的方法，它通过创建一个更宽的类别标记（CLS token），并在注意力机制前将其分割以匹配补丁标记（patch token）的宽度，随后进行自注意力处理并重组。在注意力机制后，Jumbo应用了一个专用的更宽的前馈网络（FFN）来处理这个标记。这种方法显著提升了在ImageNet-1K数据集上的表现，特别是在ViT-tiny和ViT-nano模型上分别提高了3.2%和13.5%，同时保持了普通ViT架构的优势。

链接: https://arxiv.org/abs/2502.15021
作者: Anthony Fuller,Yousef Yassin,Daniel G. Kyrollos,Evan Shelhamer,James R. Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a simple enhancement to the global processing of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Jumbo significantly improves over ViT+Registers on ImageNet-1K at high speeds (by 3.2% for ViT-tiny and 13.5% for ViT-nano); these Jumbo models even outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs. Although Jumbo sees no gains for ViT-small on ImageNet-1K, it gains 3.4% on ImageNet-21K over ViT+Registers. Both findings indicate that Jumbo is most helpful when the ViT is otherwise too narrow for the task. Finally, we show that Jumbo can be easily adapted to excel on data beyond images, e.g., time series.
zh

[CV-55] CrossOver: 3D Scene Cross-Modal Alignment

【速读】：该论文旨在解决多模态三维场景理解中常见的完全数据可用性和跨模态刚性对齐假设问题。论文提出的关键解决方案是CrossOver框架，通过灵活的场景级模态对齐，学习统一且模态无关的嵌入空间，支持鲁棒的场景检索和物体定位，即使在模态缺失的情况下也能有效工作。这一方法利用特定维度的编码器、多阶段训练流程以及涌现的跨模态行为，实现了不依赖于显式对象语义的模态对齐。

链接: https://arxiv.org/abs/2502.15011
作者: Sayan Deb Sarkar,Ondrej Miksik,Marc Pollefeys,Daniel Barath,Iro Armeni
机构: Stanford University (苏黎世联邦理工学院); ETH Zurich (苏黎世联邦理工学院); Microsoft Spatial AI Lab, Zurich (微软空间人工智能实验室, 苏黎世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this http URL

点击查看摘要

Abstract:Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting adaptability for real-world applications in 3D scene understanding.
zh

[CV-56] Digital implementations of deep feature extractors are intrinsically informative

【速读】：该论文旨在解决深度特征提取器中信息（能量）快速传播的问题，以平衡计算复杂性与输入表示的表达能力。关键解决方案在于提出了一种统一框架下的能量传播速度的上界证明方法，该方法适用于欧几里得域和非欧几里得域的不同神经网络模型。此外，通过利用信号域的额外结构信息，可以明确确定或改进能量衰减率。论文展示了对于离散域输入信号的特征提取器以及通过局部紧阿贝尔群（LCA 群）散射的卷积神经网络（CNNs），其能量呈全局指数衰减。

链接: https://arxiv.org/abs/2502.15004
作者: Max Getter
机构: RWTH Aachen University (亚琛工业大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA)
备注: 6 pages

点击查看摘要

Abstract:Rapid information (energy) propagation in deep feature extractors is crucial to balance computational complexity versus expressiveness as a representation of the input. We prove an upper bound for the speed of energy propagation in a unified framework that covers different neural network models, both over Euclidean and non-Euclidean domains. Additional structural information about the signal domain can be used to explicitly determine or improve the rate of decay. To illustrate this, we show global exponential energy decay for a range of 1) feature extractors with discrete-domain input signals, and 2) convolutional neural networks (CNNs) via scattering over locally compact abelian (LCA) groups.
zh

[CV-57] A Rapid Test for Accuracy and Bias of Face Recognition Technology WACV2025

【速读】：该论文旨在解决面部识别（Face Recognition, FR）系统准确性评估耗时且成本高昂的问题。论文的关键解决方案在于提出了一种新颖的1:1人脸验证方法，该方法能够快速且无需手动标注即可评估FR系统的准确性，通过利用从网络搜索结果等来源获得的近似标签以及被评估模型的嵌入表示（embedding representation），在较小规模的测试数据集上实现了高精度。这种方法显著减少了人工标注的时间和成本，并且首次公开比较了五个FR云服务的性能，揭示了特定群体如亚裔女性的识别准确率较低的现象。

链接: https://arxiv.org/abs/2502.14996
作者: Manuel Knott,Ignacio Serna,Ethan Mann,Pietro Perona
机构: California Institute of Technology (加州理工学院); Max Planck Institute for Human Development (马克斯·普朗克人类发展研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper for WACV 2025. Manuel Knott, Ignacio Serna, and Ethan Mann contributed equally

点击查看摘要

Abstract:Measuring the accuracy of face recognition (FR) systems is essential for improving performance and ensuring responsible use. Accuracy is typically estimated using large annotated datasets, which are costly and difficult to obtain. We propose a novel method for 1:1 face verification that benchmarks FR systems quickly and without manual annotation, starting from approximate labels (e.g., from web search results). Unlike previous methods for training set label cleaning, ours leverages the embedding representation of the models being evaluated, achieving high accuracy in smaller-sized test datasets. Our approach reliably estimates FR accuracy and ranking, significantly reducing the time and cost of manual labeling. We also introduce the first public benchmark of five FR cloud services, revealing demographic biases, particularly lower accuracy for Asian women. Our rapid test method can democratize FR testing, promoting scrutiny and responsible use of the technology. Our method is provided as a publicly accessible tool at this https URL
zh

[CV-58] LAVID: An Agent ic LVLM Framework for Diffusion-Generated Video Detection

【速读】：该论文旨在解决AI生成视频内容检测的问题。现有研究主要集中在图像领域（如deepfake检测），而视频领域的检测尚未得到充分探索。论文的关键解决方案在于提出了一种基于大型视觉语言模型（Large Vision Language Model, LVLM）的新型AI生成视频检测方法LAVID，通过显式知识增强来提升检测性能。该方法无需额外训练检测器，而是利用LVLM调用外部工具提取信息，并通过自改写结构提示来优化推理过程。这一方案突破了传统深度学习方法在透明性和识别新特征方面存在的局限性。

链接: https://arxiv.org/abs/2502.14994
作者: Qingyuan Liu,Yun-Yun Tsai,Ruijian Zha,Victoria Li,Pengyuan Shi,Chengzhi Mao,Junfeng Yang
机构: Columbia University (哥伦比亚大学); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works of AI-generated content detection have been widely studied in the image field (e.g., deepfake), yet the video field has been unexplored. Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection for its strong reasoning and multimodal capabilities. It breaks the limitations of traditional deep learning based methods faced with like lack of transparency and inability to recognize new artifacts. Motivated by this, we propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our insight list as follows: (1) The leading LVLMs can call external tools to extract useful information to facilitate its own video detection task; (2) Structuring the prompt can affect LVLM’s reasoning ability to interpret information in video content. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting. Different from prior SOTA that trains additional detectors, our method is fully training-free and only requires inference of the LVLM for detection. To facilitate our research, we also create a new benchmark \vidfor with high-quality videos generated from multiple sources of video generation tools. Evaluation results show that LAVID improves F1 scores by 6.2 to 30.2% over the top baselines on our datasets across four SOTA LVLMs.
zh

[CV-59] Ultra-High-Frequency Harmony: mmWave Radar and Event Camera Orchestrate Accurate Drone Landing

【速读】：该论文旨在解决无人机精准、高效且安全着陆时地面平台实时精确定位及引导的问题。传统帧相机与毫米波雷达在采样频率上的不匹配限制了系统的吞吐量。论文的关键解决方案是引入事件相机以匹配毫米波雷达的采样频率，并提出了一种名为mmE-Loc的高精度低延迟地面定位系统。为了充分利用这些模态之间的时序一致性和空间互补性，论文提出了两个创新模块：一致性指导协作跟踪和图驱动自适应联合优化，从而实现精确的无人机测量提取和高效的传感器融合。

链接: https://arxiv.org/abs/2502.14992
作者: Haoyang Wang,Jingao Xu,Xinyu Luo,Xuecheng Chen,Ting Zhang,Ruiyang Duan,Yunhao Liu,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Carnegie Mellon University (卡内基梅隆大学); Meituan Academy of Robotics Shenzhen (美团机器人学院深圳分院); School of Software, Tsinghua University (清华大学软件学院); Pengcheng Laboratory (鹏城实验室), Shenzhen; RISC-V International Open Source Laboratory (RISC-V国际开源实验室), Shenzhen
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by ACM SenSys 2025

点击查看摘要

Abstract:For precise, efficient, and safe drone landings, ground platforms should real-time, accurately locate descending drones and guide them to designated spots. While mmWave sensing combined with cameras improves localization accuracy, the lower sampling frequency of traditional frame cameras compared to mmWave radar creates bottlenecks in system throughput. In this work, we replace the traditional frame camera with event camera, a novel sensor that harmonizes in sampling frequency with mmWave radar within the ground platform setup, and introduce mmE-Loc, a high-precision, low-latency ground localization system designed for drone landings. To fully leverage the \textittemporal consistency and \textitspatial complementarity between these modalities, we propose two innovative modules, \textitconsistency-instructed collaborative tracking and \textitgraph-informed adaptive joint optimization, for accurate drone measurement extraction and efficient sensor fusion. Extensive real-world experiments in landing scenarios from a leading drone delivery company demonstrate that mmE-Loc outperforms state-of-the-art methods in both localization accuracy and latency.
zh

[CV-60] Few-shot Species Range Estimation

【速读】：该论文旨在解决物种空间分布范围估计的挑战，特别是在仅有有限观测记录的情况下。论文提出的关键解决方案是一种新的少样本物种分布范围估计方法，该方法在推理过程中通过输入一系列空间位置以及可选的元数据（如文本或图像），输出物种编码，并以此预测未见过的物种的分布范围。这种方法在两个具有挑战性的基准测试中验证，展示了在计算时间显著减少的情况下达到最先进的分布范围估计性能。

链接: https://arxiv.org/abs/2502.14977
作者: Christian Lange,Max Hamilton,Elijah Cole,Alexander Shepard,Samuel Heinrich,Angela Zhu,Subhransu Maji,Grant Van Horn,Oisin Mac Aodha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowing where a particular species can or cannot be found on Earth is crucial for ecological research and conservation efforts. By mapping the spatial ranges of all species, we would obtain deeper insights into how global biodiversity is affected by climate change and habitat loss. However, accurate range estimates are only available for a relatively small proportion of all known species. For the majority of the remaining species, we often only have a small number of records denoting the spatial locations where they have previously been observed. We outline a new approach for few-shot species range estimation to address the challenge of accurately estimating the range of a species from limited data. During inference, our model takes a set of spatial locations as input, along with optional metadata such as text or an image, and outputs a species encoding that can be used to predict the range of a previously unseen species in feed-forward manner. We validate our method on two challenging benchmarks, where we obtain state-of-the-art range estimation performance, in a fraction of the compute time, compared to recent alternative approaches.
zh

[CV-61] EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）中存在的对抗性攻击问题，这些问题源于大型语言模型（Large Language Models, LLMs）的固有漏洞，并因多模态特性而加剧。现有防御方法如对抗训练、输入转换和启发式检测等，存在计算成本高、依赖于特定架构以及对抗自适应攻击脆弱等问题。论文提出的关键解决方案是EigenShield，这是一种利用随机矩阵理论在推理阶段防御的方法。EigenShield通过检测结构化谱偏差来识别对抗性干扰，采用鲁棒性非一致性评分（Robustness-based Nonconformity Score, RbNS）和基于分位数的阈值处理，分离出编码语义信息的因果特征向量与易受对抗性扰动影响的相关特征向量。通过将嵌入投影到因果子空间，EigenShield能够过滤对抗噪声，而不需修改模型参数或进行对抗训练。这种方法不依赖于特定架构且对攻击方式无特定要求，显著降低了攻击成功率。

链接: https://arxiv.org/abs/2502.14976
作者: Nastaran Darabi,Devashri Naik,Sina Tayebati,Dinithi Jayasuriya,Ranganath Krishnan,Amit Ranjan Trivedi
机构: Department of Electrical and Computer Engineering, University of Illinois at Chicago, IL, USA; Intel Labs, Hillsboro, OR, USA; Department of Mathematics, University of Michigan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large Language Models (LLMs), which are further exacerbated by their multimodal nature. Existing defenses, including adversarial training, input transformations, and heuristic detection, are computationally expensive, architecture-dependent, and fragile against adaptive attacks. We introduce EigenShield, an inference-time defense leveraging Random Matrix Theory to quantify adversarial disruptions in high-dimensional VLM representations. Unlike prior methods that rely on empirical heuristics, EigenShield employs the spiked covariance model to detect structured spectral deviations. Using a Robustness-based Nonconformity Score (RbNS) and quantile-based thresholding, it separates causal eigenvectors, which encode semantic information, from correlational eigenvectors that are susceptible to adversarial artifacts. By projecting embeddings onto the causal subspace, EigenShield filters adversarial noise without modifying model parameters or requiring adversarial training. This architecture-independent, attack-agnostic approach significantly reduces the attack success rate, establishing spectral analysis as a principled alternative to conventional defenses. Our results demonstrate that EigenShield consistently outperforms all existing defenses, including adversarial training, UNIGUARD, and CIDER.
zh

[CV-62] Design of a Visual Pose Estimation Algorithm for Moon Landing

【速读】：该论文旨在解决月球探测器精准着陆过程中由于惯性传感器导致的导航漂移问题。为实现这一目标，提出了一种地形绝对导航方法来估计航天器的位置和姿态。该算法的关键在于利用航天器下方陨石坑的位置进行估计，通过事先已知的陨石坑数据库识别航天器相机拍摄到的陨石坑图像，从而实现导航校正。

链接: https://arxiv.org/abs/2502.14942
作者: Atakan Süslü,Betül Rana Kuran,Halil Ersin Söken
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 6 pages, 8 figures, Presented in 11th Nano-Satellite Symposium

点击查看摘要

Abstract:In order to make a pinpoint landing on the Moon, the spacecraft’s navigation system must be accurate. To achieve the desired accuracy, navigational drift caused by the inertial sensors must be corrected. One way to correct this drift is to use absolute navigation solutions. In this study, a terrain absolute navigation method to estimate the spacecraft’s position and attitude is proposed. This algorithm uses the position of the craters below the spacecraft for estimation. Craters seen by the camera onboard the spacecraft are detected and identified using a crater database known beforehand. In order to focus on estimation algorithms, image processing and crater matching steps are skipped. The accuracy of the algorithm and the effect of the crater number used for estimation are inspected by performing simulations.
zh

[CV-63] FacaDiffy: Inpainting Unseen Facade Parts Using Diffusion Models

【速读】：该论文旨在解决高细节语义三维建筑模型创建过程中，由于激光扫描障碍导致的2D冲突地图不完整的问题。解决方案的关键在于引入FacaDiffy方法，通过从现有的三维建筑模型和对应的激光扫描点云中推导出2D冲突地图，并利用个性化Stable Diffusion模型完成未见立面部分的修复，从而实现冲突地图的完整化。此外，为了补充现实世界训练数据的不足，还开发了一个可扩展的管道来生成合成冲突地图。

链接: https://arxiv.org/abs/2502.14940
作者: Thomas Froech,Olaf Wysocki,Yan Xia,Junyu Xie,Benedikt Schwab,Daniel Cremers,Thomas H. Kolbe
机构: Technical University of Munich (TUM)(慕尼黑工业大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for GeoSpatial Week 2025, ISPRS Annals

点击查看摘要

Abstract:High-detail semantic 3D building models are frequently utilized in robotics, geoinformatics, and computer vision. One key aspect of creating such models is employing 2D conflict maps that detect openings’ locations in building facades. Yet, in reality, these maps are often incomplete due to obstacles encountered during laser scanning. To address this challenge, we introduce FacaDiffy, a novel method for inpainting unseen facade parts by completing conflict maps with a personalized Stable Diffusion model. Specifically, we first propose a deterministic ray analysis approach to derive 2D conflict maps from existing 3D building models and corresponding laser scanning point clouds. Furthermore, we facilitate the inpainting of unseen facade objects into these 2D conflict maps by leveraging the potential of personalizing a Stable Diffusion model. To complement the scarcity of real-world training data, we also develop a scalable pipeline to produce synthetic conflict maps using random city model generators and annotated facade images. Extensive experiments demonstrate that FacaDiffy achieves state-of-the-art performance in conflict map completion compared to various inpainting baselines and increases the detection rate by 22% when applying the completed conflict maps for high-definition 3D semantic building reconstruction. The code is be publicly available in the corresponding GitHub repository: this https URL
zh

[CV-64] Online hand gesture recognition using Continual Graph Transformers

【速读】：该论文旨在解决在线连续动作识别在实时场景中的挑战，特别是针对现有方法主要集中在基于段落的识别而无法满足实时连续识别需求的问题。论文的关键解决方案在于提出了一种结合空间图卷积网络（Spatial Graph Convolutional Networks, S-GCN）和基于Transformer的图编码器（Transformer-based Graph Encoder, TGE）的混合架构，用于实时骨架序列流的识别。此外，引入了一种持续学习机制以增强模型适应不断变化的数据分布的能力，从而确保在动态环境中动作识别的鲁棒性。

链接: https://arxiv.org/abs/2502.14939
作者: Rim Slama,Wael Rabah,Hazem Wannous
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online continuous action recognition has emerged as a critical research area due to its practical implications in real-world applications, such as human-computer interaction, healthcare, and robotics. Among various modalities, skeleton-based approaches have gained significant popularity, demonstrating their effectiveness in capturing 3D temporal data while ensuring robustness to environmental variations. However, most existing works focus on segment-based recognition, making them unsuitable for real-time, continuous recognition scenarios. In this paper, we propose a novel online recognition system designed for real-time skeleton sequence streaming. Our approach leverages a hybrid architecture combining Spatial Graph Convolutional Networks (S-GCN) for spatial feature extraction and a Transformer-based Graph Encoder (TGE) for capturing temporal dependencies across frames. Additionally, we introduce a continual learning mechanism to enhance model adaptability to evolving data distributions, ensuring robust recognition in dynamic environments. We evaluate our method on the SHREC’21 benchmark dataset, demonstrating its superior performance in online hand gesture recognition. Our approach not only achieves state-of-the-art accuracy but also significantly reduces false positive rates, making it a compelling solution for real-time applications. The proposed system can be seamlessly integrated into various domains, including human-robot collaboration and assistive technologies, where natural and intuitive interaction is crucial.
zh

[CV-65] GS-Cache: A GS-Cache Inference Framework for Large-scale Gaussian Splatting Models

【速读】：该论文旨在解决在消费级设备上实时渲染大规模3D高斯点 splatting (3DGS) 模型所面临的挑战，以实现高保真性能。论文的关键解决方案是提出了一种端到端框架GS-Cache，它将3DGS的先进表示与高度优化的渲染系统无缝集成。GS-Cache通过引入以缓存为中心的管线消除冗余计算、效率感知调度器以支持弹性多GPU渲染，并优化CUDA内核来克服计算瓶颈。这种结合使得GS-Cache实现了高达5.35倍的性能提升，降低了35%的延迟，并减少了42%的GPU内存使用，从而支持超过120 FPS的2K双目渲染，同时保持高质量的视觉效果。

链接: https://arxiv.org/abs/2502.14938
作者: Miao Tao,Yuanzhen Zhou,Haoran Xu,Zeyu He,Zhenyu Yang,Yuchang Zhang,Zhongling Su,Linning Xu,Zhenxiang Ma,Rong Fu,Hengjie Li,Xingcheng Zhang,Jidong Zhai
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室), China; Tsinghua University (清华大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rendering large-scale 3D Gaussian Splatting (3DGS) model faces significant challenges in achieving real-time, high-fidelity performance on consumer-grade devices. Fully realizing the potential of 3DGS in applications such as virtual reality (VR) requires addressing critical system-level challenges to support real-time, immersive experiences. We propose GS-Cache, an end-to-end framework that seamlessly integrates 3DGS’s advanced representation with a highly optimized rendering system. GS-Cache introduces a cache-centric pipeline to eliminate redundant computations, an efficiency-aware scheduler for elastic multi-GPU rendering, and optimized CUDA kernels to overcome computational bottlenecks. This synergy between 3DGS and system design enables GS-Cache to achieve up to 5.35x performance improvement, 35% latency reduction, and 42% lower GPU memory usage, supporting 2K binocular rendering at over 120 FPS with high visual quality. By bridging the gap between 3DGS’s representation power and the demands of VR systems, GS-Cache establishes a scalable and efficient framework for real-time neural rendering in immersive environments.
zh

[CV-66] RAPTOR: Refined Approach for Product Table Object Recognition WACV

【速读】：该论文旨在解决从文档中提取表格时面临的挑战，特别是在处理多样化的表格格式和常见的检测错误（如区域检测不准确和列重叠）方面。论文的关键解决方案是引入了一个名为RAPTOR的模块化后处理系统，该系统能够增强现有的先进模型以改进表格提取，尤其是针对产品表格。RAPTOR通过优化状态-of-the-art模型中的关键参数，有效解决了表格检测（TD）和表格结构识别（TSR）中的常见问题，从而提升了精度和结构预测的准确性。

链接: https://arxiv.org/abs/2502.14918
作者: Eliott Thomas,Mickael Coustaty,Aurelie Joseph,Elodie Carel,Vincent Poulain D’Andecy,Jean-Marc Ogier
机构: Yooz(尤兹); La Rochelle Université(拉罗谢尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted for WACVW 2025 (VisionDocs)

点击查看摘要

Abstract:Extracting tables from documents is a critical task across various industries, especially on business documents like invoices and reports. Existing systems based on DEtection TRansformer (DETR) such as TAble TRansformer (TATR), offer solutions for Table Detection (TD) and Table Structure Recognition (TSR) but face challenges with diverse table formats and common errors like incorrect area detection and overlapping columns. This research introduces RAPTOR, a modular post-processing system designed to enhance state-of-the-art models for improved table extraction, particularly for product tables. RAPTOR addresses recurrent TD and TSR issues, improving both precision and structural predictions. For TD, we use DETR (trained on ICDAR 2019) and TATR (trained on PubTables-1M and FinTabNet), while TSR only relies on TATR. A Genetic Algorithm is incorporated to optimize RAPTOR’s module parameters, using a private dataset of product tables to align with industrial needs. We evaluate our method on two private datasets of product tables, the public DOCILE dataset (which contains tables similar to our target product tables), and the ICDAR 2013 and ICDAR 2019 datasets. The results demonstrate that while our approach excels at product tables, it also maintains reasonable performance across diverse table formats. An ablation study further validates the contribution of each module in our system.
zh

[CV-67] Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning

【速读】：该论文旨在解决从高层面的交通场景语义理解到低层面运动控制指令有效转换的挑战，并实现跨场景驾驶的一致性和泛化能力。解决方案的关键在于引入Sce2DriveX框架，该框架通过多模态联合学习局部场景视频和全局BEV地图，深入理解长距离时空关系和道路拓扑结构，从而增强其在三维动态/静态场景中的综合感知和推理能力。此外，Sce2DriveX重建了人类驾驶过程中的隐含认知链，涵盖场景理解、元动作推理、行为解释分析、运动规划与控制，进一步弥合了自动驾驶与人类思维过程之间的差距。为了提升模型性能，还开发了一个专为三维空间理解和长轴任务推理设计的首个大规模视觉问答驾驶指令数据集。

链接: https://arxiv.org/abs/2502.14917
作者: Rui Zhao,Qirui Yuan,Jinyu Li,Haofeng Hu,Yun Li,Chengyuan Zheng,Fei Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an important part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve generalization and consensus in cross-scene driving. We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning MLLM framework. Sce2DriveX utilizes multimodal joint learning from local scene videos and global BEV maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its comprehensive perception and reasoning capabilities in 3D dynamic/static scenes and achieving driving generalization across scenes. Building on this, it reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control, thereby further bridging the gap between autonomous driving and human thought processes. To elevate model performance, we have developed the first extensive Visual Question Answering (VQA) driving instruction dataset tailored for 3D spatial understanding and long-axis task reasoning. Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.
zh

[CV-68] PTB-Image: A Scanned Paper ECG Dataset for Digitization and Image-based Diagnosis

【速读】：该论文旨在解决纸质心电图（ECG）在临床实践中面临的自动化分析和数字存储挑战。解决方案的关键在于引入PTB-Image数据集以及VinDigitizer基线方法，通过检测信号行、从背景中提取波形并重构数字化痕迹的数值，将纸质ECG转换为数字时间序列信号，从而实现ECG的数字化。

链接: https://arxiv.org/abs/2502.14909
作者: Cuong V. Nguyen,Hieu X. Nguyen,Dung D. Pham Minh,Cuong D. Do
机构: VinUniversity (VinUni), Hanoi, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electrocardiograms (ECGs) recorded on paper remain prevalent in clinical practice, yet their use presents challenges for automated analysis and digital storage. To address this issue, we introduce PTB-Image, a dataset comprising scanned paper ECGs with corresponding digital signals, enabling research on ECG digitization. We also provide VinDigitizer, a digitization baseline to convert paper-based ECGs into digital time-series signals. The method involves detecting signal rows, extracting waveforms from the background, and reconstructing numerical values from the digitized traces. We applied VinDigitizer to 549 scanned ECGs and evaluated its performance against the original PTB dataset (modified to match the printed signals). The results achieved a mean signal-to-noise ratio (SNR) of 0.01 dB, highlighting both the feasibility and challenges of ECG digitization, particularly in mitigating distortions from printing and scanning processes. By providing PTB-Image and baseline digitization methods, this work aims to facilitate advancements in ECG digitization, enhancing access to historical ECG data and supporting applications in telemedicine and automated cardiac diagnostics.
zh

[CV-69] UPCMR: A Universal Prompt-guided Model for Random Sampling Cardiac MRI Reconstruction

【速读】：该论文旨在解决心脏磁共振成像（Cardiac Magnetic Resonance Imaging, CMR）诊断过程中长扫描时间的问题。为了解决这一问题，研究引入了UPCMR模型，这是一种通用展开模型，用于CMR重建。其关键是通过结合两种可学习的提示（undersampling-specific prompt 和 spatial-specific prompt），并将它们与UNet结构在每个模块中集成，从而在不同采样模式和欠采样因子下提高重建图像质量，最终实现快速扫描同时保持高质量图像的目标。

链接: https://arxiv.org/abs/2502.14899
作者: Donghang Lyu,Chinmay Rao,Marius Staring,Matthias J.P. van Osch,Mariya Doneva,Hildo J. Lamb,Nicola Pezzotti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted paper for STACOM 2024

点击查看摘要

Abstract:Cardiac magnetic resonance imaging (CMR) is vital for diagnosing heart diseases, but long scan time remains a major drawback. To address this, accelerated imaging techniques have been introduced by undersampling k-space, which reduces the quality of the resulting images. Recent deep learning advancements aim to speed up scanning while preserving quality, but adapting to various sampling modes and undersampling factors remains challenging. Therefore, building a universal model is a promising direction. In this work, we introduce UPCMR, a universal unrolled model designed for CMR reconstruction. This model incorporates two kinds of learnable prompts, undersampling-specific prompt and spatial-specific prompt, and integrates them with a UNet structure in each block. Overall, by using the CMRxRecon2024 challenge dataset for training and validation, the UPCMR model highly enhances reconstructed image quality across all random sampling scenarios through an effective training strategy compared to some traditional methods, demonstrating strong adaptability potential for this task.
zh

[CV-70] A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在生成高质量视觉内容时可能产生版权侵权、敏感及有害内容的问题。论文的关键解决方案是概念擦除（concept erasure），通过修改T2I模型以防止生成不希望的内容。论文将概念擦除方法分为三类：参数更新的微调（fine-tuning）、高效编辑的闭式解（closed-form solutions）以及推理时干预（inference-time interventions），从而在不改变权重的情况下限制内容生成。

链接: https://arxiv.org/abs/2502.14896
作者: Changhoon Kim,Yanjun Qi
机构: Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) models have made remarkable progress in generating high-quality, diverse visual content from natural language prompts. However, their ability to reproduce copyrighted styles, sensitive imagery, and harmful content raises significant ethical and legal concerns. Concept erasure offers a proactive alternative to external filtering by modifying T2I models to prevent the generation of undesired content. In this survey, we provide a structured overview of concept erasure, categorizing existing methods based on their optimization strategies and the architectural components they modify. We categorize concept erasure methods into fine-tuning for parameter updates, closed-form solutions for efficient edits, and inference-time interventions for content restriction without weight modification. Additionally, we explore adversarial attacks that bypass erasure techniques and discuss emerging defenses. To support further research, we consolidate key datasets, evaluation metrics, and benchmarks for assessing erasure effectiveness and model robustness. This survey serves as a comprehensive resource, offering insights into the evolving landscape of concept erasure, its challenges, and future directions.
zh

[CV-71] High-Dynamic Radar Sequence Prediction for Weather Nowcasting Using Spatiotemporal Coherent Gaussian Representation ICLR2025

【速读】：该论文旨在解决天气临近预报中三维雷达序列预测的问题，现有方法主要局限于二维空间预测且效率与存储受限。论文的关键解决方案在于引入了一种基于时空一致性高斯点阵(SpatioTemporal Coherent Gaussian Splatting, STC-GS)的新框架，用于动态雷达场景的高效表示，并结合GauMamba模型进行精准预测。STC-GS通过优化每帧的三维场景，利用一组高斯分布来捕捉连续帧中的动态变化，确保每个高斯分布随时间的一致跟踪。而GauMamba则通过集成记忆机制改进了Mamba框架，使其能够有效处理大量高斯令牌，同时学习高斯组的时间演变，从而在保证效率的同时提升预测准确性。

链接: https://arxiv.org/abs/2502.14895
作者: Ziye Wang,Yiran Qin,Lin Zeng,Ruimao Zhang
机构: Sun Yat-sen University (中山大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳); Guangzhou Meteorological Observatory (广州气象局)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted as an Oral paper at ICLR 2025. Project page: this https URL

点击查看摘要

Abstract:Weather nowcasting is an essential task that involves predicting future radar echo sequences based on current observations, offering significant benefits for disaster management, transportation, and urban planning. Current prediction methods are limited by training and storage efficiency, mainly focusing on 2D spatial predictions at specific altitudes. Meanwhile, 3D volumetric predictions at each timestamp remain largely unexplored. To address such a challenge, we introduce a comprehensive framework for 3D radar sequence prediction in weather nowcasting, using the newly proposed SpatioTemporal Coherent Gaussian Splatting (STC-GS) for dynamic radar representation and GauMamba for efficient and accurate forecasting. Specifically, rather than relying on a 4D Gaussian for dynamic scene reconstruction, STC-GS optimizes 3D scenes at each frame by employing a group of Gaussians while effectively capturing their movements across consecutive frames. It ensures consistent tracking of each Gaussian over time, making it particularly effective for prediction tasks. With the temporally correlated Gaussian groups established, we utilize them to train GauMamba, which integrates a memory mechanism into the Mamba framework. This allows the model to learn the temporal evolution of Gaussian groups while efficiently handling a large volume of Gaussian tokens. As a result, it achieves both efficiency and accuracy in forecasting a wide range of dynamic meteorological radar signals. The experimental results demonstrate that our STC-GS can efficiently represent 3D radar sequences with over 16\times higher spatial resolution compared with the existing 3D representation methods, while GauMamba outperforms state-of-the-art methods in forecasting a broad spectrum of high-dynamic weather conditions.
zh

[CV-72] FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

【速读】：该论文旨在解决在大区域内准确预测全氟和多氟烷基物质（PFAS）表面水污染的问题。解决方案的关键在于FOCUS框架，这是一个融合了水文流动数据、土地覆盖信息及已知PFAS来源距离的地理空间深度学习模型，并采用了一种考虑标签噪声的损失函数，以提升预测准确性。

链接: https://arxiv.org/abs/2502.14894
作者: Jowaria Khan,Alexa Friedman,Sydney Evans,Runzi Wang,Kaley Beins,David Andrews,Elizabeth Bondi-Kelly
机构: University of Michigan (密歇根大学); Environmental Working Group (环境工作组); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Per and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results highlight our framework’s potential for scalable PFAS monitoring.
zh

[CV-73] NOTA: Multimodal Music Notation Understanding for Visual Large Language Model

【速读】：该论文旨在解决现有大型语言模型在多模态音乐符号理解方面的不足。目前的研究主要集中在单一模态的符号序列文本上，而现有的通用领域视觉语言模型仍缺乏对音乐记谱法的理解能力。为了解决这一问题，论文提出了NOTA，这是一个大规模综合多模态音乐记谱数据集，包含来自世界三大地区的1,019,237条记录，并涵盖三个任务。基于此数据集，作者训练了NotaGPT，一个音乐记谱视觉大语言模型。关键在于引入了一个预对齐训练阶段，以实现音乐乐谱图像中的音符与ABC记谱法文本表示之间的跨模态对齐。通过这一方法，显著提升了音乐理解能力，验证了NOTA及其训练流程的有效性。

链接: https://arxiv.org/abs/2502.14893
作者: Mingni Tang,Jiajia Li,Lu Yang,Zhiqiang Zhang,Jinghao Tian,Zuchao Li,Lefei Zhang,Ping Wang
机构: School of Computer Science, Wuhan University, Wuhan, China (武汉大学计算机学院); Key Laboratory of Archival Intelligent Development and Service, NAAC (NAAC档案智能开发与服务重点实验室); School of Information Management, Wuhan University, Wuhan, China (武汉大学信息管理学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Symbolic music is represented in two distinct forms: two-dimensional, visually intuitive score images, and one-dimensional, standardized text annotation sequences. While large language models have shown extraordinary potential in music, current research has primarily focused on unimodal symbol sequence text. Existing general-domain visual language models still lack the ability of music notation understanding. Recognizing this gap, we propose NOTA, the first large-scale comprehensive multimodal music notation dataset. It consists of 1,019,237 records, from 3 regions of the world, and contains 3 tasks. Based on the dataset, we trained NotaGPT, a music notation visual large language model. Specifically, we involve a pre-alignment training phase for cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation. Subsequent training phases focus on foundational music information extraction, followed by training on music notation analysis. Experimental results demonstrate that our NotaGPT-7B achieves significant improvement on music understanding, showcasing the effectiveness of NOTA and the training pipeline. Our datasets are open-sourced at this https URL.
zh

[CV-74] EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild NAACL2025

【速读】：该论文旨在解决在真实世界环境中实时预测对话时机这一根本性挑战。解决方案的关键在于EgoSpeak框架，它通过第一人称视角模型、RGB处理、在线处理以及非剪辑视频处理这四个关键能力，弥合了简化实验设置与复杂自然对话之间的差距。

链接: https://arxiv.org/abs/2502.14892
作者: Junhyeok Kim,Min Soo Kim,Jiwan Chung,Jungbin Cho,Jisoo Kim,Sungwoong Kim,Gyeongbo Sim,Youngjae Yu
机构: Yonsei University (延世大学); Multimodal AI Lab., NC Research, NCSOFT Corporation (NC软研多模态人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NAACL 2025 Findings. Project page at this https URL

点击查看摘要

Abstract:Predicting when to initiate speech in real-world environments remains a fundamental challenge for conversational agents. We introduce EgoSpeak, a novel framework for real-time speech initiation prediction in egocentric streaming video. By modeling the conversation from the speaker’s first-person viewpoint, EgoSpeak is tailored for human-like interactions in which a conversational agent must continuously observe its environment and dynamically decide when to talk. Our approach bridges the gap between simplified experimental setups and complex natural conversations by integrating four key capabilities: (1) first-person perspective, (2) RGB processing, (3) online processing, and (4) untrimmed video processing. We also present YT-Conversation, a diverse collection of in-the-wild conversational videos from YouTube, as a resource for large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that EgoSpeak outperforms random and silence-based baselines in real time. Our results also highlight the importance of multimodal input and context length in effectively deciding when to speak.
zh

[CV-75] CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection

【速读】：该论文旨在解决多智能体系统在协同三维物体检测过程中因姿态估计误差和时间延迟导致的信息融合中的空间和时间噪声问题，从而引发的检测错误。论文的关键解决方案是提出了CoDiff框架，通过利用扩散模型（diffusion models）的去噪能力，将高维特征图投影到潜在空间，并利用每个智能体的信息作为条件指导扩散模型的采样过程，逐步去噪并精细化融合特征，从而生成更为全面和清晰的特征表示。

链接: https://arxiv.org/abs/2502.14891
作者: Zhe Huang,Shuo Wang,Yongcai Wang,Lei Wang
机构: Renmin University of China(中国人民大学); University of Wollongong(卧龙岗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collaborative 3D object detection holds significant importance in the field of autonomous driving, as it greatly enhances the perception capabilities of each individual agent by facilitating information exchange among multiple agents. However, in practice, due to pose estimation errors and time delays, the fusion of information across agents often results in feature representations with spatial and temporal noise, leading to detection errors. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to explore the use of diffusion models to address the noise problem between multi-agent systems. In this work, we propose CoDiff, a novel robust collaborative perception framework that leverages the potential of diffusion models to generate more comprehensive and clearer feature representations. To the best of our knowledge, this is the first work to apply diffusion models to multi-agent collaborative perception. Specifically, we project high-dimensional feature map into the latent space of a powerful pre-trained autoencoder. Within this space, individual agent information serves as a condition to guide the diffusion model’s sampling. This process denoises coarse feature maps and progressively refines the fused features. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework CoDiff consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose and delay information of agents is with high-level noise.
zh

[CV-76] WeedVision: Multi-Stage Growth and Classification of Weeds using DETR and RetinaNet for Precision Agriculture ICML

【速读】：该论文旨在解决农作物管理中的杂草识别难题，特别是在不同生长阶段准确检测多种经济相关杂草物种的挑战。研究的关键在于利用先进的目标检测模型，特别是基于Detection Transformer (DETR) 和RetinaNet的模型，通过训练和评估，实现对16种经济相关杂草物种在174个类别中的精准识别与分类。实验结果表明，RetinaNet模型在准确率（mAP）和召回率方面表现出色，并且具有更快的推理速度，使其更适合实时应用。

链接: https://arxiv.org/abs/2502.14890
作者: Taminul Islam,Toqi Tahamid Sarker,Khaled R Ahmed,Cristiana Bernardi Rankrape,Karla Gage
机构: Southern Illinois University (南伊利诺伊大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and Presented to ICMLA, 2024

点击查看摘要

Abstract:Weed management remains a critical challenge in agriculture, where weeds compete with crops for essential resources, leading to significant yield losses. Accurate detection of weeds at various growth stages is crucial for effective management yet challenging for farmers, as it requires identifying different species at multiple growth phases. This research addresses these challenges by utilizing advanced object detection models, specifically, the Detection Transformer (DETR) with a ResNet50 backbone and RetinaNet with a ResNeXt101 backbone, to identify and classify 16 weed species of economic concern across 174 classes, spanning their 11 weeks growth stages from seedling to maturity. A robust dataset comprising 203,567 images was developed, meticulously labeled by species and growth stage. The models were rigorously trained and evaluated, with RetinaNet demonstrating superior performance, achieving a mean Average Precision (mAP) of 0.907 on the training set and 0.904 on the test set, compared to DETR’s mAP of 0.854 and 0.840, respectively. RetinaNet also outperformed DETR in recall and inference speed of 7.28 FPS, making it more suitable for real time applications. Both models showed improved accuracy as plants matured. This research provides crucial insights for developing precise, sustainable, and automated weed management strategies, paving the way for real time species specific detection systems and advancing AI-assisted agriculture through continued innovation in model development and early detection accuracy.
zh

[CV-77] Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability ICLR2025

【速读】：该论文旨在解决多模态图像-文本表示模型的可解释性问题，特别是在医疗等实际应用中的安全部署需求。传统瓶颈方法虽已应用于增强CLIP模型的可解释性，但受限于强假设或内在随机性。论文的关键解决方案是提出了“窄化信息瓶颈理论”(Narrowing Information Bottleneck Theory)，这一新框架重新定义了传统的瓶颈方法，特别设计以满足现代归因公理，从而提供了一种更稳健且可靠的手段来提升多模态模型的可解释性。实验结果显示，与现有最佳方法相比，该方法平均提升了图像可解释性9%，文本可解释性58.83%，并加快了处理速度63.95%。

链接: https://arxiv.org/abs/2502.14889
作者: Zhiyu Zhu,Zhibo Jin,Jiayu Zhang,Nan Yang,Jiahao Huang,Jianlong Zhou,Fang Chen
机构: University of Technology Sydney(悉尼科技大学); SuZhouYierqi(苏州衣二厨); University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP’s interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the Narrowing Information Bottleneck Theory, a novel framework that fundamentally redefines the traditional bottleneck approach. This theory is specifically designed to satisfy contemporary attribution axioms, providing a more robust and reliable solution for improving the interpretability of multimodal models. In our experiments, compared to state-of-the-art methods, our approach enhances image interpretability by an average of 9%, text interpretability by an average of 58.83%, and accelerates processing speed by 63.95%. Our code is publicly accessible at this https URL.
zh

[CV-78] he Multi-Faceted Monosemanticity in Multimodal Representations

【速读】：该论文旨在解决从大规模多模态模型（如CLIP）中提取可解释特征的问题，以理解不同模态之间的差距。关键解决方案在于利用近期在特征单义性（feature monosemanticity）方面的进展，通过引入模态优势得分（Modality Dominance Score, MDS）来评估和归因每个特征的可解释性，并将其转换到更易解释的空间中，从而分类出视觉特征（单模态）、语言特征（单模态）和跨模态视觉-语言特征。这一方法不仅揭示了这些特征与人类认知模态的理解高度一致，还展示了其在检测性别偏见、对抗攻击防御及文本到图像模型编辑等任务中的显著应用价值。

链接: https://arxiv.org/abs/2502.14888
作者: Hanqi Yan,Xiangxiang Cui,Lu Yin,Paul Pu Liang,Yulan He,Yifei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we leverage recent advancements in feature monosemanticity to extract interpretable features from deep multimodal models, offering a data-driven understanding of modality gaps. Specifically, we investigate CLIP (Contrastive Language-Image Pretraining), a prominent visual-language representation model trained on extensive image-text pairs. Building upon interpretability tools developed for single-modal models, we extend these methodologies to assess multi-modal interpretability of CLIP features. Additionally, we introduce the Modality Dominance Score (MDS) to attribute the interpretability of each feature to its respective modality. Next, we transform CLIP features into a more interpretable space, enabling us to categorize them into three distinct classes: vision features (single-modal), language features (single-modal), and visual-language features (cross-modal). Our findings reveal that this categorization aligns closely with human cognitive understandings of different modalities. We also demonstrate significant use cases of this modality-specific features including detecting gender bias, adversarial attack defense and text-to-image model editing. These results indicate that large-scale multimodal models, equipped with task-agnostic interpretability tools, offer valuable insights into key connections and distinctions between different modalities.
zh

[CV-79] Vision-Enhanced Time Series Forecasting via Latent Diffusion Models

【速读】：该论文旨在解决时间序列预测中的跨模态建模挑战及有效转换视觉信息以捕捉时间模式的问题。关键解决方案在于提出LDM4TS框架，通过互补变换技术将时间序列转换为多视角视觉表示，并利用预训练的视觉编码器进行丰富的特征提取。随后，这些表示通过带有跨模态条件机制和融合模块的隐变量扩散模型进行重建。这一方法使得模型能够更好地处理时间序列数据的预测任务。

链接: https://arxiv.org/abs/2502.14887
作者: Weilin Ruan,Siru Zhong,Haomin Wen,Yuxuan Liang
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have recently emerged as powerful frameworks for generating high-quality images. While recent studies have explored their application to time series forecasting, these approaches face significant challenges in cross-modal modeling and transforming visual information effectively to capture temporal patterns. In this paper, we propose LDM4TS, a novel framework that leverages the powerful image reconstruction capabilities of latent diffusion models for vision-enhanced time series forecasting. Instead of introducing external visual data, we are the first to use complementary transformation techniques to convert time series into multi-view visual representations, allowing the model to exploit the rich feature extraction capabilities of the pre-trained vision encoder. Subsequently, these representations are reconstructed using a latent diffusion model with a cross-modal conditioning mechanism as well as a fusion module. Experimental results demonstrate that LDM4TS outperforms various specialized forecasting models for time series forecasting tasks.
zh

[CV-80] Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review

【速读】：该论文旨在解决微创手术（MIS）中利用机器学习（ML）和深度学习（DL）技术提升手术场景理解的问题。关键在于将先进的技术如卷积神经网络（CNNs）、视觉变换器（ViTs）以及基础模型（FMs），包括Segment Anything Model（SAM），集成到手术流程中，以提高手术内窥视频分析中的分割精度、器械跟踪及阶段识别。论文同时探讨了这些技术面临的挑战，如数据变化性和计算需求，并讨论了临床应用中的伦理考量与整合障碍。

链接: https://arxiv.org/abs/2502.14886
作者: Ufaq Khan,Umair Nawaz,Adnan Qayyum,Shazad Ashraf,Muhammad Bilal,Junaid Qadir
机构: Mohamed Bin Zayed University of Artificial Intelligence (阿联酋阿布扎比); HBKU (教育城, 卡塔尔); University Hospitals Birmingham (英国伯明翰); Birmingham City University (英国); Qatar University (卡塔尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in machine learning (ML) and deep learning (DL), particularly through the introduction of foundational models (FMs), have significantly enhanced surgical scene understanding within minimally invasive surgery (MIS). This paper surveys the integration of state-of-the-art ML and DL technologies, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and foundational models like the Segment Anything Model (SAM), into surgical workflows. These technologies improve segmentation accuracy, instrument tracking, and phase recognition in surgical endoscopic video analysis. The paper explores the challenges these technologies face, such as data variability and computational demands, and discusses ethical considerations and integration hurdles in clinical settings. Highlighting the roles of FMs, we bridge the technological capabilities with clinical needs and outline future research directions to enhance the adaptability, efficiency, and ethical alignment of AI applications in surgery. Our findings suggest that substantial progress has been made; however, more focused efforts are required to achieve seamless integration of these technologies into clinical workflows, ensuring they complement surgical practice by enhancing precision, reducing risks, and optimizing patient outcomes.
zh

[CV-81] SEM-CLIP: Precise Few-Shot Learning for Nanoscale Defect Detection in Scanning Electron Microscope Image

【速读】：该论文旨在解决在集成电路制造过程中，纳米级晶圆缺陷检测与分类面临的挑战，特别是扫描电子显微镜（SEM）图像中的复杂背景图案和多样化的缺陷纹理所带来的难题。传统方法通常因数据量不足、标签稀缺以及泛化能力差而受限。论文提出的关键解决方案是开发了一种新颖的少样本学习方法——SEM-CLIP，通过定制对比语言图像预训练（CLIP）模型，使其更专注于缺陷区域并减少背景干扰，从而提高分割精度。此外，该方法利用富含领域知识的文本提示作为先验信息，并结合特征工程与文本指导，以实现更有效的缺陷分类。SEM-CLIP仅需少量标注数据，显著降低了半导体行业的劳动需求。

链接: https://arxiv.org/abs/2502.14884
作者: Qian Jin,Yuqi Jiang,Xudong Lu,Yumeng Liu,Yining Chen,Dawei Gao,Qi Sun,Cheng Zhuo
机构: Zhejiang University(浙江大学); Zhejiang University, HIC-ZJU(浙江大学, HIC-ZJU)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in ACM/IEEE International Conference on Computer-Aided Design (ICCAD), 2024

点击查看摘要

Abstract:In the field of integrated circuit manufacturing, the detection and classification of nanoscale wafer defects are critical for subsequent root cause analysis and yield enhancement. The complex background patterns observed in scanning electron microscope (SEM) images and the diverse textures of the defects pose significant challenges. Traditional methods usually suffer from insufficient data, labels, and poor transferability. In this paper, we propose a novel few-shot learning approach, SEM-CLIP, for accurate defect classification and segmentation. SEM-CLIP customizes the Contrastive Language-Image Pretraining (CLIP) model to better focus on defect areas and minimize background distractions, thereby enhancing segmentation accuracy. We employ text prompts enriched with domain knowledge as prior information to assist in precise analysis. Additionally, our approach incorporates feature engineering with textual guidance to categorize defects more effectively. SEM-CLIP requires little annotated data, substantially reducing labor demands in the semiconductor industry. Extensive experimental validation demonstrates that our model achieves impressive classification and segmentation results under few-shot learning scenarios.
zh

[CV-82] Can LVLMs and Automatic Metrics Capture Underlying Preferences of Blind and Low-Vision Individuals for Navigational Aid?

【速读】：该论文旨在解决盲人和低视力（Blind and Low-Vision, BLV）用户在使用基于语义的大规模视觉语言模型（Large Vision-Language Models, LVLMs）进行导航辅助时，对于不同风格响应的偏好问题。此前尚未有研究系统探讨BLV用户的偏好。为此，论文的关键解决方案是构建Eye4B数据集，并通过深入用户研究来评估八名BLV用户对六种LVLMs在四个维度（Afraidness, Nonactionability, Sufficiency, 和Conciseness）上的偏好。此外，引入Eye4B基准以评估常用模型基图像文本度量与收集到的BLV用户偏好的一致性。

链接: https://arxiv.org/abs/2502.14883
作者: Na Min An,Eunki Kim,Wan Ju Kang,Sangryul Kim,Hyunjung Shim,James Thorne
机构: KAIST(高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 12 figures, 14 tables

点击查看摘要

Abstract:Vision is a primary means of how humans perceive the environment, but Blind and Low-Vision (BLV) people need assistance understanding their surroundings, especially in unfamiliar environments. The emergence of semantic-based systems as assistance tools for BLV users has motivated many researchers to explore responses from Large Vision-Language Models (LVLMs). However, it has yet been studied preferences of BLV users on diverse types/styles of responses from LVLMs, specifically for navigational aid. To fill this gap, we first construct Eye4B dataset, consisting of human-validated 1.1k curated outdoor/indoor scenes with 5-10 relevant requests per scene. Then, we conduct an in-depth user study with eight BLV users to evaluate their preferences on six LVLMs from five perspectives: Afraidness, Nonactionability, Sufficiency, and Conciseness. Finally, we introduce Eye4B benchmark for evaluating alignment between widely used model-based image-text metrics and our collected BLV preferences. Our work can be set as a guideline for developing BLV-aware LVLMs towards a Barrier-Free AI system.
zh

[CV-83] From 16-Bit to 1-Bit: Visual KV Cache Quantization for Memory-Efficient Multimodal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在部署过程中因大规模键值（KV）缓存导致的内存消耗问题。论文的关键解决方案是一种视觉量化策略，通过组特定量化和分位数量化方法实现1比特量化，从而在不丢失信息的前提下显著减少内存使用，同时保持计算效率和多模态性能。这种方法可以无缝集成到现有的MLLM架构中，无需进行结构上的修改。

链接: https://arxiv.org/abs/2502.14882
作者: Zeliang Zhang,Yifan Zhu,Susan Liang,Zhiyuan Wang,Jiani Liu,Haiting Lin,Mingjie Zhao,Chenliang Xu,Kun Wan,Wentian Zhao
机构: University of Rochester (罗彻斯特大学); UCSB (加州大学圣塔芭芭拉分校); Adobe Inc. (Adobe公司); Juniper Networks (Juniper Networks)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success across various applications, yet their computational overhead during deployment remains a critical challenge. While Key-Value (KV) caching improves inference efficiency by trading memory for computation, the growing memory footprint from storing extensive KV caches reduces throughput and limits long-term execution on devices with constrained GPU memory. Existing approaches primarily focus on dropping unimportant tokens to reduce the KV cache size, mitigating memory constraints at the cost of potential information loss. In contrast, we propose a simple yet effective visual quantization strategy that preserves all visual tokens while significantly reducing memory consumption. To achieve an extreme quantization ratio, i.e., 1-bit quantization, we propose group-specific quantization and quantile-based quantization approaches, motivated by the inherent patterns of the KV cache. Our method is plug-and-play, enabling seamless integration into various MLLMs to improve memory efficiency without architectural modifications. Extensive experiments demonstrate that our approach effectively reduces memory overhead while maintaining computational efficiency and preserving multimodal performance.
zh

[CV-84] A Survey of Safety on Large Vision-Language Models: Attacks Defenses and Evaluations

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）的安全性问题。关键在于提出一个统一框架，整合攻击、防御及评估方法等多个相互关联的组件，以提供全面的视角来揭示LVLMs的漏洞及其相应的缓解策略。此外，通过引入生命周期分类框架，进一步区分推理与训练阶段，并提供更深入的见解。

链接: https://arxiv.org/abs/2502.14881
作者: Mang Ye,Xuankun Rong,Wenke Huang,Bo Du,Nenghai Yu,Dacheng Tao
机构: School of Computer Science, Wuhan University (武汉大学计算机学院), Wuhan, China; School of Cyber Science and Technology, University of Science and Technology of China (中国科学技术大学网络科学与技术学院), Hefei, China; College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算与数据科学学院), Singapore
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 2 figures

点击查看摘要

Abstract:With the rapid advancement of Large Vision-Language Models (LVLMs), ensuring their safety has emerged as a crucial area of research. This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods. We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs and the corresponding mitigation strategies. Through an analysis of the LVLM lifecycle, we introduce a classification framework that distinguishes between inference and training phases, with further subcategories to provide deeper insights. Furthermore, we highlight limitations in existing research and outline future directions aimed at strengthening the robustness of LVLMs. As part of our research, we conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results. Our findings provide strategic recommendations for advancing LVLM safety and ensuring their secure and reliable deployment in high-stakes, real-world applications. This survey aims to serve as a cornerstone for future research, facilitating the development of models that not only push the boundaries of multimodal intelligence but also adhere to the highest standards of security and ethical integrity. Furthermore, to aid the growing research in this field, we have created a public repository to continuously compile and update the latest work on LVLM safety: this https URL .
zh

[CV-85] KKA: Improving Vision Anomaly Detection through Anomaly-related Knowledge from Large Language Models

【速读】：该论文旨在解决在无监督视觉异常检测中，由于异常样本的广泛变异性，难以区分正常样本与异常样本的问题。此外，现有方法生成的异常样本往往缺乏真实性且对构建有效边界的支持有限。为了解决这些问题，论文提出了一种名为Key Knowledge Augmentation (KKA) 的方法，通过从大型语言模型 (LLMs) 中提取与异常相关的知识，生成基于正常样本的有意义的异常样本，并将其分类为易区分异常和难区分异常。KKA 方法的关键在于逐步增加难区分异常样本的比例，以帮助检测器学习更有效的边界。实验结果表明，该方法显著提升了多种视觉异常检测器的性能，同时保持了较低的生成成本。

链接: https://arxiv.org/abs/2502.14880
作者: Dong Chen,Zhengqing Hu,Peiguang Fan,Yueting Zhuang,Yafei Li,Qidong Liu,Xiaoheng Jiang,Mingliang Xu
机构: School of Computer and Artificial Intelligence of Zhengzhou University, Zhengzhou, China (郑州大学计算机与人工智能学院, 郑州, 中国); Engineering Research Center of Intelligent Swarm Systems, Ministry of Education, Zhengzhou, China (智能集群系统教育部工程研究中心, 郑州, 中国); National Supercomputing Center In Zhengzhou, Zhengzhou, China (国家超级计算中心, 郑州, 中国); Zhejiang University, Hangzhou, China (浙江大学, 杭州, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision anomaly detection, particularly in unsupervised settings, often struggles to distinguish between normal samples and anomalies due to the wide variability in anomalies. Recently, an increasing number of studies have focused on generating anomalies to help detectors learn more effective boundaries between normal samples and anomalies. However, as the generated anomalies are often derived from random factors, they frequently lack realism. Additionally, randomly generated anomalies typically offer limited support in constructing effective boundaries, as most differ substantially from normal samples and lie far from the boundary. To address these challenges, we propose Key Knowledge Augmentation (KKA), a method that extracts anomaly-related knowledge from large language models (LLMs). More specifically, KKA leverages the extensive prior knowledge of LLMs to generate meaningful anomalies based on normal samples. Then, KKA classifies the generated anomalies as easy anomalies and hard anomalies according to their similarity to normal samples. Easy anomalies exhibit significant differences from normal samples, whereas hard anomalies closely resemble normal samples. KKA iteratively updates the generated anomalies, and gradually increasing the proportion of hard anomalies to enable the detector to learn a more effective boundary. Experimental results show that the proposed method significantly improves the performance of various vision anomaly detectors while maintaining low generation costs. The code for CMG can be found at this https URL.
zh

[CV-86] Bridging Text and Vision: A Multi-View Text-Vision Registration Approach for Cross-Modal Place Recognition

【速读】：该论文旨在解决移动机器人在自然语言理解方面的需求，以准确识别位置并执行如包裹递送等任务。传统视觉位置识别（Visual Place Recognition, VPR）方法仅依赖单一视角的视觉信息，无法解读人类语言描述。为克服这一挑战，论文提出了一种名为Text4VPR的多视角（360°环境视图）文本-视觉配准方法，该方法首次完全利用文本描述来匹配图像数据库。Text4VPR的关键在于使用冻结的T5语言模型提取全局文本嵌入，并通过Sinkhorn算法结合温度系数将局部标记分配到相应的聚类，从而聚合图像的视觉描述符。在推理阶段，Text4VPR采用级联交叉注意力余弦对齐（Cascaded Cross-Attention Cosine Alignment, CCCA）来解决文本与图像组之间的内部不匹配问题。这种方法显著提升了基于文本描述的位置匹配精度，在Street360Loc数据集上的测试集中实现了57%的领先顶级精确度和92%的前10名精确度（在5米半径内），表明从文本描述到图像的定位不仅可行，且具有进一步发展的潜力。

链接: https://arxiv.org/abs/2502.14195
作者: Tianyi Shang,Zhenyu Li,Pengjie Xu,Jinwei Qiao,Gang Chen,Zihan Ruan,Weijun Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 4 figures, conference

点击查看摘要

Abstract:Mobile robots necessitate advanced natural language understanding capabilities to accurately identify locations and perform tasks such as package delivery. However, traditional visual place recognition (VPR) methods rely solely on single-view visual information and cannot interpret human language descriptions. To overcome this challenge, we bridge text and vision by proposing a multiview (360° views of the surroundings) text-vision registration approach called Text4VPR for place recognition task, which is the first method that exclusively utilizes textual descriptions to match a database of images. Text4VPR employs the frozen T5 language model to extract global textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with temperature coefficient to assign local tokens to their respective clusters, thereby aggregating visual descriptors from images. During the training stage, Text4VPR emphasizes the alignment between individual text-image pairs for precise textual description. In the inference stage, Text4VPR uses the Cascaded Cross-Attention Cosine Alignment (CCCA) to address the internal mismatch between text and image groups. Subsequently, Text4VPR performs precisely place match based on the descriptions of text-image groups. On Street360Loc, the first text to image VPR dataset we created, Text4VPR builds a robust baseline, achieving a leading top-1 accuracy of 57% and a leading top-10 accuracy of 92% within a 5-meter radius on the test set, which indicates that localization from textual descriptions to images is not only feasible but also holds significant potential for further advancement, as shown in Figure 1.
zh

[CV-87] d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining ICPR

【速读】：该论文旨在解决从用户指定的粗略手绘草图生成高质量逼真图像的问题。论文的关键在于利用大规模扩散模型（Denoising Diffusion Probabilistic Models, DDPMs）的特征泛化能力，通过一个可学习的轻量级映射网络实现源域到目标域的隐空间特征转换，而无需重新训练大规模模型。这种方法在定性和定量基准测试中均优于现有技术，实现了从粗糙手绘草图生成高分辨率逼真图像的能力。

链接: https://arxiv.org/abs/2502.14007
作者: Prasun Roy,Saumik Bhattacharya,Subhankar Ghosh,Umapada Pal,Michael Blumenstein
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Accepted in The International Conference on Pattern Recognition (ICPR) 2024

点击查看摘要

Abstract:Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.
zh

[CV-88] MambaPlace:Text-to-Point-Cloud Cross-Modal Place Recognition with Attention Mamba Mechanisms

【速读】：该论文旨在解决视觉语言位置识别（VLVPR）在复杂场景下通过传统神经架构难以有效捕捉跨模态动态交互的问题。解决方案的关键在于提出了一种名为MambaPlace的端到端连接的粗到精跨模态位置识别框架。该框架首先利用预训练的T5模型和实例编码器分别处理文本描述和3D点云数据，在粗定位阶段通过Text Attention Mamba (TAM) 和Point Clouds Mamba (PCM) 进行数据增强和对齐。随后在细定位阶段，通过级联的Cross Attention Mamba (CCAM) 对文本描述和3D点云特征进行跨模态融合与进一步增强，最终预测文本点云特征的位置偏移，实现高精度定位。

链接: https://arxiv.org/abs/2408.15740
作者: Tianyi Shang,Zhenyu Li,Pengjie Xu,Jinwei Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages

点击查看摘要

Abstract:Vision Language Place Recognition (VLVPR) enhances robot localization performance by incorporating natural language descriptions from images. By utilizing language information, VLVPR directs robot place matching, overcoming the constraint of solely depending on vision. The essence of multimodal fusion lies in mining the complementary information between different modalities. However, general fusion methods rely on traditional neural architectures and are not well equipped to capture the dynamics of cross modal interactions, especially in the presence of complex intra modal and inter modal correlations. To this end, this paper proposes a novel coarse to fine and end to end connected cross modal place recognition framework, called MambaPlace. In the coarse localization stage, the text description and 3D point cloud are encoded by the pretrained T5 and instance encoder, respectively. They are then processed using Text Attention Mamba (TAM) and Point Clouds Mamba (PCM) for data enhancement and alignment. In the subsequent fine localization stage, the features of the text description and 3D point cloud are cross modally fused and further enhanced through cascaded Cross Attention Mamba (CCAM). Finally, we predict the positional offset from the fused text point cloud features, achieving the most accurate localization. Extensive experiments show that MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset compared to the state of the art methods.
zh

[CV-89] High Quality Segmentation for Ultra High-resolution Images

【速读】：该论文旨在解决超高分辨率（4K或6K）图像分割过程中计算成本与精度之间的平衡问题。现有策略如下采样、分块裁剪及级联模型无法很好地解决这一矛盾。论文提出连续精化模型（Continuous Refinement Model, CRM），通过连续对齐特征图和聚合特征来重建图像细节，从而实现从粗到精的逐级精化。CRM的关键在于其显著的泛化能力，能够弥合低分辨率训练图像与超高分辨率测试图像之间的分辨率差距。这种方法在图像分割精化任务中表现出快速且有效的性能。

链接: https://arxiv.org/abs/2111.14482
作者: Tiancheng Shen,Yuechen Zhang,Lu Qi,Jason Kuen,Xingyu Xie,Jianlong Wu,Zhe Lin,Jiaya Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To segment 4K or 6K ultra high-resolution images needs extra computation consideration in image segmentation. Common strategies, such as down-sampling, patch cropping, and cascade model, cannot address well the balance issue between accuracy and computation cost. Motivated by the fact that humans distinguish among objects continuously from coarse to precise levels, we propose the Continuous Refinement Model~(CRM) for the ultra high-resolution segmentation refinement task. CRM continuously aligns the feature map with the refinement target and aggregates features to reconstruct these images’ details. Besides, our CRM shows its significant generalization ability to fill the resolution gap between low-resolution training images and ultra high-resolution testing ones. We present quantitative performance evaluation and visualization to show that our proposed method is fast and effective on image segmentation refinement. Code will be released at this https URL.
zh

[CV-90] Anatomy-Informed Deep Learning and Radiomics for Automated Neurofibroma Segmentation in Whole-Body MRI

【速读】：该论文旨在解决神经纤维瘤病类型1（Neurofibromatosis Type 1, NF1）患者在全身磁共振成像（Whole-Body Magnetic Resonance Imaging, WB-MRI）中神经纤维瘤（neurofibromas, NFs）的准确和自动化分割问题。解决方案的关键在于提出了一种包含三个阶段的全自动分割流程：解剖结构分割、神经纤维瘤分割以及肿瘤候选区域分类。通过使用MRSegmentator模型生成解剖结构分割掩膜，并结合输入图像作为解剖结构上下文信息以提高神经纤维瘤分割精度。此外，采用三维各向异性解剖结构感知U-Net网络集合来生成神经纤维瘤分割置信度掩膜，并利用影像组学特征从置信度掩膜中提取肿瘤候选区域，从而区分肿瘤与非肿瘤区域，减少假阳性结果。实验结果显示，整合解剖信息后，在高肿瘤负荷情况下，每扫描Dice相似性系数（Dice Similarity Coefficient, DSC）提高了68%，每肿瘤DSC增加了21%，肿瘤检测的F1分数提升了两倍。

链接: https://arxiv.org/abs/2502.15424
作者: Georgii Kolokolnikov,Marie-Lena Schmalhofer,Lennart Well,Said Farschtschi,Victor-Felix Mautner,Inka Ristow,Rene Werner
机构: University Medical Center Hamburg-Eppendorf (汉堡大学医学中心); Institute for Applied Medical Informatics (应用医学信息学研究所); Institute of Computational Neuroscience (计算神经科学研究所); Center for Biomedical Artificial Intelligence (bAIome) (生物医学人工智能中心(bAIome)); Department of Diagnostic and Interventional Radiology and Nuclear Medicine (诊断和介入放射科及核医学系); Department of Neurology (神经病学系)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neurofibromatosis Type 1 is a genetic disorder characterized by the development of neurofibromas (NFs), which exhibit significant variability in size, morphology, and anatomical location. Accurate and automated segmentation of these tumors in whole-body magnetic resonance imaging (WB-MRI) is crucial to assess tumor burden and monitor disease progression. In this study, we present and analyze a fully automated pipeline for NF segmentation in fat-suppressed T2-weighted WB-MRI, consisting of three stages: anatomy segmentation, NF segmentation, and tumor candidate classification. In the first stage, we use the MRSegmentator model to generate an anatomy segmentation mask, extended with a high-risk zone for NFs. This mask is concatenated with the input image as anatomical context information for NF segmentation. The second stage employs an ensemble of 3D anisotropic anatomy-informed U-Nets to produce an NF segmentation confidence mask. In the final stage, tumor candidates are extracted from the confidence mask and classified based on radiomic features, distinguishing tumors from non-tumor regions and reducing false positives. We evaluate the proposed pipeline on three test sets representing different conditions: in-domain data (test set 1), varying imaging protocols and field strength (test set 2), and low tumor burden cases (test set 3). Experimental results show a 68% improvement in per-scan Dice Similarity Coefficient (DSC), a 21% increase in per-tumor DSC, and a two-fold improvement in F1 score for tumor detection in high tumor burden cases by integrating anatomy information. The method is integrated into the 3D Slicer platform for practical clinical use, with the code publicly accessible.
zh

[CV-91] Quantum autoencoders for image classification

【速读】：该论文旨在解决复杂高维数据在经典机器学习中的处理难题，并提出了一种基于量子自动编码器（Quantum Autoencoder, QAE）的新图像分类方法。解决方案的关键在于完全利用量子电路进行数据压缩和重构，仅通过经典优化来调整参数，从而实现纯粹的量子特征提取。实验结果显示，特定的参数化量子电路（ansatz）结构能够显著提高分类准确性，并且该方法在性能上可与传统机器学习方法相媲美，同时大幅减少了需要优化的参数数量。这表明QAE可以作为高效的分类模型，突显了使用量子电路进行端到端学习的潜力，不同于混合方法如量子卷积神经网络（QCNN）。

链接: https://arxiv.org/abs/2502.15254
作者: Hinako Asaoka,Kazue Kudo
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classical machine learning often struggles with complex, high-dimensional data. Quantum machine learning offers a potential solution, promising more efficient processing. While the quantum convolutional neural network (QCNN), a hybrid quantum-classical algorithm, is suitable for current noisy intermediate-scale quantum-era hardware, its learning process relies heavily on classical computation. Future large-scale, gate-based quantum computers could unlock the full potential of quantum effects in machine learning. In contrast to QCNNs, quantum autoencoders (QAEs) leverage classical optimization solely for parameter tuning. Data compression and reconstruction are handled entirely within quantum circuits, enabling purely quantum-based feature extraction. This study introduces a novel image-classification approach using QAEs, achieving classification without requiring additional qubits compared with conventional QAE implementations. The quantum circuit structure significantly impacts classification accuracy. Unlike hybrid methods such as QCNN, QAE-based classification emphasizes quantum computation. Our experiments demonstrate high accuracy in a four-class classification task, evaluating various quantum-gate configurations to understand the impact of different parameterized quantum circuit (ansatz) structures on classification performance. Our results reveal that specific ansatz structures achieve superior accuracy, and we provide an analysis of their effectiveness. Moreover, the proposed approach achieves performance comparable to that of conventional machine-learning methods while significantly reducing the number of parameters requiring optimization. These findings indicate that QAEs can serve as efficient classification models with fewer parameters and highlight the potential of utilizing quantum circuits for complete end-to-end learning, a departure from hybrid approaches such as QCNN.
zh

[CV-92] Lung-DDPM: Semantic Layout-guided Diffusion Models for Thoracic CT Image Synthesis

【速读】：该论文旨在解决肺部癌症筛查中数据稀缺的问题，特别是由于昂贵的标注过程和隐私顾虑限制了大规模医疗数据集的构建，从而阻碍了人工智能在医疗健康领域的进一步应用。论文的关键解决方案是提出了Lung-DDPM方法，这是一种基于语义布局引导的去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPM），能够生成高质量的三维合成CT图像。这些合成图像在下游肺结节分割任务中表现出色，且在图像质量评估和分割性能方面优于其他最先进的生成模型。

链接: https://arxiv.org/abs/2502.15204
作者: Yifan Jiang,Yannick Lemaréchal,Josée Bafaro,Jessica Abi-Rjeile,Philippe Joubert,Philippe Després,Venkata Manem
机构: Centre de recherche du CHU de Québec-Université Laval(魁北克省立大学CHU de Québec研究中心); Quebec Heart & Lung Institute Research Center(魁北克心脏与肺研究所研究中心), Québec, Canada
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The code and pretrained models are available at this https URL

点击查看摘要

Abstract:With the rapid development of artificial intelligence (AI), AI-assisted medical imaging analysis demonstrates remarkable performance in early lung cancer screening. However, the costly annotation process and privacy concerns limit the construction of large-scale medical datasets, hampering the further application of AI in healthcare. To address the data scarcity in lung cancer screening, we propose Lung-DDPM, a thoracic CT image synthesis approach that effectively generates high-fidelity 3D synthetic CT images, which prove helpful in downstream lung nodule segmentation tasks. Our method is based on semantic layout-guided denoising diffusion probabilistic models (DDPM), enabling anatomically reasonable, seamless, and consistent sample generation even from incomplete semantic layouts. Our results suggest that the proposed method outperforms other state-of-the-art (SOTA) generative models in image quality evaluation and downstream lung nodule segmentation tasks. Specifically, Lung-DDPM achieved superior performance on our large validation cohort, with a Fréchet inception distance (FID) of 0.0047, maximum mean discrepancy (MMD) of 0.0070, and mean squared error (MSE) of 0.0024. These results were 7.4 \times , 3.1 \times , and 29.5 \times better than the second-best competitors, respectively. Furthermore, the lung nodule segmentation model, trained on a dataset combining real and Lung-DDPM-generated synthetic samples, attained a dice coefficient (Dice) of 0.3914 and sensitivity of 0.4393. This represents 8.8% and 18.6% improvements in DICE and sensitivity compared to the model trained solely on real samples. The experimental results highlight Lung-DDPM’s potential for a broader range of medical imaging applications, such as general tumor segmentation, cancer survival estimation, and risk prediction.
zh

[CV-93] Interleaved Block-based Learned Image Compression with Feature Enhancement and Quantization Error Compensation

【速读】：该论文旨在解决学习图像压缩（LIC）中的两个关键挑战：获得更紧凑的潜在表示和减少量化误差的影响。为了解决这些问题，论文提出了一种特征提取模块、一种特征精炼模块和一种特征增强模块。这些模块通过打乱像素、分割子图像、提取粗略特征、堆叠特征以及利用跨通道和子图像内的相关性来学习更紧凑的潜在特征，并减少量化后的信息损失。此外，还提出了一种量化误差补偿模块以减轻训练和测试之间的量化差异。

链接: https://arxiv.org/abs/2502.15188
作者: Shiqi Jiang,Hui Yuan,Shuai Li,Raouf Hamzaoui,Xu Wang,Junyan Huo
机构: School of software, Shandong University (软件学院，山东大学); School of Control Science and Engineering, Shandong University (控制科学与工程学院，山东大学); School of Engineering and Sustainable Development, De Montfort University (工程与可持续发展学院，德蒙福特大学); College of Computer Science and Software Engineering, Shenzhen University (计算机科学与软件工程学院，深圳大学); School of Telecommunications Engineering, Xidian University (通信工程学院，西安电子科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, learned image compression (LIC) methods have achieved significant performance improvements. However, obtaining a more compact latent representation and reducing the impact of quantization errors remain key challenges in the field of LIC. To address these challenges, we propose a feature extraction module, a feature refinement module, and a feature enhancement module. Our feature extraction module shuffles the pixels in the image, splits the resulting image into sub-images, and extracts coarse features from the sub-images. Our feature refinement module stacks the coarse features and uses an attention refinement block composed of concatenated three-dimensional convolution residual blocks to learn more compact latent features by exploiting correlations across channels, within sub-images (intra-sub-image correlations), and across sub-images (inter-sub-image correlations). Our feature enhancement module reduces information loss in the decoded features following quantization. We also propose a quantization error compensation module that mitigates the quantization mismatch between training and testing. Our four modules can be readily integrated into state-of-the-art LIC methods. Experiments show that combining our modules with Tiny-LIC outperforms existing LIC methods and image compression standards in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM) on the Kodak dataset and the CLIC dataset.
zh

[CV-94] LUMINA-Net: Low-light Upgrade through Multi-stage Illumination and Noise Adaptation Network for Image Enhancement

【速读】：该论文旨在解决低光图像增强（Low-light Image Enhancement, LLIE）中的常见挑战，如噪声、过曝及色彩失真等问题，这些问题通常会导致图像质量显著下降。论文的关键解决方案在于提出了一种名为LUMINA-Net的先进深度学习框架，该框架通过集成多阶段的亮度调整与反射率模块来实现。首先，亮度模块智能调节亮度和对比度，同时精细保留纹理细节；其次，反射率模块引入了一种利用空间注意力和通道特征优化的降噪机制，以减轻噪声污染。通过在LOL和SICE数据集上的实验验证，LUMINA-Net在峰值信噪比（PSNR）、结构相似性指数（SSIM）和线性比例感知图像差分（LPIPS）等指标上超越了现有技术，展示了其在低光图像增强方面的有效性。

链接: https://arxiv.org/abs/2502.15186
作者: Namrah Siddiqua,Kim Suneung
机构: Korea University (韩国国立庆北大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Low-light image enhancement (LLIE) is a crucial task in computer vision aimed to enhance the visual fidelity of images captured under low-illumination conditions. Conventional methods frequently struggle to mitigate pervasive shortcomings such as noise, over-exposure, and color distortion thereby precipitating a pronounced degradation in image quality. To address these challenges, we propose LUMINA-Net an advanced deep learning framework designed specifically by integrating multi-stage illumination and reflectance modules. First, the illumination module intelligently adjusts brightness and contrast levels while meticulously preserving intricate textural details. Second, the reflectance module incorporates a noise reduction mechanism that leverages spatial attention and channel-wise feature refinement to mitigate noise contamination. Through a comprehensive suite of experiments conducted on LOL and SICE datasets using PSNR, SSIM and LPIPS metrics, surpassing state-of-the-art methodologies and showcasing its efficacy in low-light image enhancement.
zh

[CV-95] FD-LSCIC: Frequency Decomposition-based Learned Screen Content Image Compression

【速读】：该论文旨在解决屏幕内容（Screen Content, SC）图像压缩中的三个关键挑战：学习紧凑的潜在特征、适应量化步长以及缺乏大规模SC图像数据集。为了解决这些问题，论文提出了一种新颖的压缩方法，关键在于引入了多频两阶段八度残差块（MToRB）进行特征提取，级联三尺度特征融合残差块（CTSFRB）进行多尺度特征整合，以及多频上下文交互模块（MFCIM）以减少频率间的相关性。此外，还引入了自适应量化模块，通过为每个频率分量学习缩放均匀噪声来实现量化粒度的灵活控制。同时，构建了一个包含超过10,000张不同类型的SC图像的大规模数据集（SDU-SCICD10K），从而显著提升了SC图像压缩性能。

链接: https://arxiv.org/abs/2502.15174
作者: Shiqi Jiang,Hui Yuan,Shuai Li,Huanqiang Zeng,Sam Kwong
机构: School of Control Science and Engineering, Shandong University, Ji’nan, 250100, China (山东大学控制科学与工程学院); Shandong Inspur Artificial Intelligence Research Institute Co., Ltd., Ji’nan, China (山东浪潮人工智能研究院有限公司); School of Information Science and Engineering, Huaqiao University, Xiamen 361021, China (华侨大学信息科学与工程学院); School of Data Science, Lingnan University, Hong Kong (岭南大学数据科学学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The learned image compression (LIC) methods have already surpassed traditional techniques in compressing natural scene (NS) images. However, directly applying these methods to screen content (SC) images, which possess distinct characteristics such as sharp edges, repetitive patterns, embedded text and graphics, yields suboptimal results. This paper addresses three key challenges in SC image compression: learning compact latent features, adapting quantization step sizes, and the lack of large SC datasets. To overcome these challenges, we propose a novel compression method that employs a multi-frequency two-stage octave residual block (MToRB) for feature extraction, a cascaded triple-scale feature fusion residual block (CTSFRB) for multi-scale feature integration and a multi-frequency context interaction module (MFCIM) to reduce inter-frequency correlations. Additionally, we introduce an adaptive quantization module that learns scaled uniform noise for each frequency component, enabling flexible control over quantization granularity. Furthermore, we construct a large SC image compression dataset (SDU-SCICD10K), which includes over 10,000 images spanning basic SC images, computer-rendered images, and mixed NS and SC images from both PC and mobile platforms. Experimental results demonstrate that our approach significantly improves SC image compression performance, outperforming traditional standards and state-of-the-art learning-based methods in terms of peak signal-to-noise ratio (PSNR) and multi-scale structural similarity (MS-SSIM).
zh

[CV-96] Optimized Pap Smear Image Enhancement: Hybrid PMD Filter-CLAHE Using Spider Monkey Optimization

【速读】：该论文旨在解决宫颈癌筛查中巴氏涂片图像质量对检测准确性的影响。解决方案的关键在于提出了一种优化的混合方法，结合了Perona-Malik扩散（PMD）滤波器与对比度受限自适应直方图均衡化（CLAHE），并通过蜘蛛猴优化算法（SMO）进行参数优化。该方法通过PMD滤波器减少图像噪声，利用CLAHE提升图像对比度，从而显著改善了巴氏涂片图像的质量。

链接: https://arxiv.org/abs/2502.15156
作者: Ach Khozaimi,Isnani Darti,Syaiful Anam,Wuryansari Muharini Kusumawinahyu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pap smear image quality is crucial for cervical cancer detection. This study introduces an optimized hybrid approach that combines the Perona-Malik Diffusion (PMD) filter with contrast-limited adaptive histogram equalization (CLAHE) to enhance Pap smear image quality. The PMD filter reduces the image noise, whereas CLAHE improves the image contrast. The hybrid method was optimized using spider monkey optimization (SMO PMD-CLAHE). BRISQUE and CEIQ are the new objective functions for the PMD filter and CLAHE optimization, respectively. The simulations were conducted using the SIPaKMeD dataset. The results indicate that SMO outperforms state-of-the-art methods in optimizing the PMD filter and CLAHE. The proposed method achieved an average effective measure of enhancement (EME) of 5.45, root mean square (RMS) contrast of 60.45, Michelson’s contrast (MC) of 0.995, and entropy of 6.80. This approach offers a new perspective for improving Pap smear image quality.
zh

[CV-97] Compact Latent Representation for Image Compression (CLRIC)

【速读】：该论文旨在解决现有图像压缩模型需要为每个质量级别训练独立模型的问题，这导致资源消耗大，无论是训练还是存储方面。论文的关键解决方案是利用预训练模型（如Stable Diffusion变分自编码器）中的潜在变量进行感知图像压缩。通过在目标模型的潜在表示上使用过拟合可学习函数，该方法能够在任意所需的质量级别上进行压缩，无需为不同质量级别设计独立模型。这种方法确保了低计算复杂度，并且在训练和解码过程中高效利用资源，实现了与最先进的学习图像压缩模型相当的感知质量，同时具有模型无关性和分辨率无关性。

链接: https://arxiv.org/abs/2502.14937
作者: Ayman A. Ameen,Thomas Richter,André Kaup
机构: Fraunhofer Institute for Integrated Circuits IIS (弗劳恩霍夫集成电路研究所), Erlangen, Germany; Department of Physics, Faculty of Science, Sohag University (索哈格大学), Egypt; Friedrich-Alexander University at Erlangen-Nürnberg (埃尔兰根-纽伦堡大学), Erlangen, Germany
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current image compression models often require separate models for each quality level, making them resource-intensive in terms of both training and storage. To address these limitations, we propose an innovative approach that utilizes latent variables from pre-existing trained models (such as the Stable Diffusion Variational Autoencoder) for perceptual image compression. Our method eliminates the need for distinct models dedicated to different quality levels. We employ overfitted learnable functions to compress the latent representation from the target model at any desired quality level. These overfitted functions operate in the latent space, ensuring low computational complexity, around 25.5 MAC/pixel for a forward pass on images with dimensions (1363 \times 2048) pixels. This approach efficiently utilizes resources during both training and decoding. Our method achieves comparable perceptual quality to state-of-the-art learned image compression models while being both model-agnostic and resolution-agnostic. This opens up new possibilities for the development of innovative image compression methods.
zh

[CV-98] Reducing false positives in strong lens detection through effective augmentation and ensemble learning

【速读】：该论文旨在研究高质量训练数据对卷积神经网络（Convolutional Neural Networks, CNNs）在强引力透镜探测中的影响。论文强调数据多样性和代表性的重要性，并展示了样本群体变化如何影响CNN性能。解决方案的关键在于通过采用数据增强和集成学习等技术，显著降低了假阳性率（False Positive Rate, FP率）至 (10^{-4})，同时保持了模型完整性，仅使真阳性样本数量减少了2.3%。这些方法提升了引力透镜探测模型的鲁棒性，并推动了相关领域的进步。实验结果验证了所提出方法的有效性，尤其是在Kilo Degree Survey (KiDS) 数据集上的表现。

链接: https://arxiv.org/abs/2502.14936
作者: Samira Rezaei,Amirmohammad Chegeni,Bharath Chowdhary Nagam,J. P. McKean,Mitra Baratchi,Koen Kuijken,Léon V. E. Koopmans
机构: Leiden Observatory, Leiden University (莱顿天文台，莱顿大学); Leiden Institute of Advanced Computer Science (LIACS), Leiden University (莱顿先进计算机科学研究所，莱顿大学); Dipartimento di Fisica e Astronomia ”G. Galilei”, Università di Padova (帕多瓦大学物理与天文学系“伽利略”); INFN-Padova (意大利国家核物理研究院帕多瓦分部); Kapteyn Astronomical Institute, University of Groningen (卡普坦天文研究所，格罗宁根大学); South African Radio Astronomy Observatory (SARAO) (南非射电天文观测站); Department of Physics, University of Pretoria (南非比勒陀利亚大学物理系)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figures, 7 tables, Accepted for publication in MNRAS

点击查看摘要

Abstract:This research studies the impact of high-quality training datasets on the performance of Convolutional Neural Networks (CNNs) in detecting strong gravitational lenses. We stress the importance of data diversity and representativeness, demonstrating how variations in sample populations influence CNN performance. In addition to the quality of training data, our results highlight the effectiveness of various techniques, such as data augmentation and ensemble learning, in reducing false positives while maintaining model completeness at an acceptable level. This enhances the robustness of gravitational lens detection models and advancing capabilities in this field. Our experiments, employing variations of DenseNet and EfficientNet, achieved a best false positive rate (FP rate) of 10^-4 , while successfully identifying over 88 per cent of genuine gravitational lenses in the test dataset. This represents an 11-fold reduction in the FP rate compared to the original training dataset. Notably, this substantial enhancement in the FP rate is accompanied by only a 2.3 per cent decrease in the number of true positive samples. Validated on the KiDS dataset, our findings offer insights applicable to ongoing missions, like Euclid.
zh

[CV-99] Denoising segmentation and volumetric rendering of optical coherence tomography angiography (OCTA) image using deep learning techniques: a review

【速读】：该论文旨在解决光学相干断层扫描血管成像（Optical Coherence Tomography Angiography, OCTA）数据中存在的噪声和伪影问题，这些问题影响了诊断的准确性和重复性。论文的关键解决方案在于利用深度学习（Deep Learning, DL）模型自动检测和去除噪声及伪影，并提升图像数据的质量。此外，这些DL模型在分割和识别正常与病理结构方面也表现出色，从而显著增强了OCTA图像的解释能力和测量预测的准确性。

链接: https://arxiv.org/abs/2502.14935
作者: Kejie Chen,Xiaochun Yang,Jing Na,Wenbo Wang
机构: Department of Mechanical and Electronic Engineering, Kunming University of Technology (昆明理工大学机械与电气工程学院); Eye Clinics, Yunnan First People’s Hospital (云南省第一人民医院眼科诊所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical coherence tomography angiography (OCTA) is a non-invasive imaging technique widely used to study vascular structures and micro-circulation dynamics in the retina and choroid. OCTA has been widely used in clinics for diagnosing ocular disease and monitoring its progression, because OCTA is safer and faster than dye-based angiography while retaining the ability to characterize micro-scale structures. However, OCTA data contains many inherent noises from the devices and acquisition protocols and suffers from various types of artifacts, which impairs diagnostic accuracy and repeatability. Deep learning (DL) based imaging analysis models are able to automatically detect and remove artifacts and noises, and enhance the quality of image data. It is also a powerful tool for segmentation and identification of normal and pathological structures in the images. Thus, the value of OCTA imaging can be significantly enhanced by the DL-based approaches for interpreting and performing measurements and predictions on the OCTA data. In this study, we reviewed literature on the DL models for OCTA images in the latest five years. In particular, we focused on discussing the current problems in the OCTA data and the corresponding design principles of the DL models. We also reviewed the state-of-art DL models for 3D volumetric reconstruction of the vascular networks and pathological structures such as the edema and distorted optic disc. In addition, the publicly available dataset of OCTA images are summarized at the end of this review. Overall, this review can provide valuable insights for engineers to develop novel DL models by utilizing the characteristics of OCTA signals and images. The pros and cons of each DL methods and their applications discussed in this review can be helpful to assist technicians and clinicians to use proper DL models for fundamental research and disease screening.
zh

[CV-100] Distributed U-net model and Image Segmentation for Lung Cancer Detection

【速读】：该论文旨在解决COVID-19大流行后肺部疾病（如肺癌和慢性阻塞性肺病COPD）的早期检测和准确诊断问题，以提高治疗效果和患者预后。关键解决方案在于利用计算机辅助设计（CAD）系统，特别是通过详细的U-Net模型研究及其与VGG16算法结合增强的能力，从而实现肺部CT图像的精确分割。实验结果表明，在多硬件配置下，尤其是使用四块GPU进行分布式学习时，U-Net模型表现出色，凸显了基于U-Net的CAD系统在肺部疾病检测和诊断中的巨大潜力。

链接: https://arxiv.org/abs/2502.14928
作者: Tianzuo Hu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Until now, in the wake of the COVID-19 pandemic in 2019, lung diseases, especially diseases such as lung cancer and chronic obstructive pulmonary disease (COPD), have become an urgent global health issue. In order to mitigate the goal problem, early detection and accurate diagnosis of these conditions are critical for effective treatment and improved patient outcomes. To further research and reduce the error rate of hospital diagnoses, this comprehensive study explored the potential of computer-aided design (CAD) systems, especially utilizing advanced deep learning models such as U-Net. And compared with the literature content of other authors, this study explores the capabilities of U-Net in detail, and enhances the ability to simulate CAD systems through the VGG16 algorithm. An extensive dataset consisting of lung CT images and corresponding segmentation masks, curated collaboratively by multiple academic institutions, serves as the basis for empirical validation. In this paper, the efficiency of U-Net model is evaluated rigorously and precisely under multiple hardware configurations, such as single CPU, single GPU, distributed GPU and federated learning, and the effectiveness and development of the method in the segmentation task of lung disease are demonstrated. Empirical results clearly affirm the robust performance of the U-Net model, most effectively utilizing four GPUs for distributed learning, and these results highlight the potential of U-Net-based CAD systems for accurate and timely lung disease detection and diagnosis huge potential.
zh

[CV-101] Display Field-Of-View Agnostic Robust CT Kernel Synthesis Using Model-Based Deep Learning

【速读】：该论文旨在解决在X射线计算机断层扫描（CT）成像中，根据不同显示视野（Display Field-of-View, DFOV）和重建核选择导致的图像处理时间和存储需求增加的问题。论文的关键在于提出了一种基于模型的深度学习方法，该方法能够高效地实现独立于DFOV的图像基重建核综合。通过将CT重建核和DFOV特性显式整合到前向模型中，该方法在临床数据上的实验结果和使用细丝体模数据进行的调制传递函数估计定量分析均证明了其在实时应用中的有效性，并且相较于缺乏前向模型信息的直接学习网络，该方法对于DFOV变化具有更高的鲁棒性。

链接: https://arxiv.org/abs/2502.14920
作者: Hemant Kumar Aggarwal,Antony Jerald,Phaneendra K. Yalavarthy,Rajesh Langoju,Bipul Das
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ISBI 2025

点击查看摘要

Abstract:In X-ray computed tomography (CT) imaging, the choice of reconstruction kernel is crucial as it significantly impacts the quality of clinical images. Different kernels influence spatial resolution, image noise, and contrast in various ways. Clinical applications involving lung imaging often require images reconstructed with both soft and sharp kernels. The reconstruction of images with different kernels requires raw sinogram data and storing images for all kernels increases processing time and storage requirements. The Display Field-of-View (DFOV) adds complexity to kernel synthesis, as data acquired at different DFOVs exhibit varying levels of sharpness and details. This work introduces an efficient, DFOV-agnostic solution for image-based kernel synthesis using model-based deep learning. The proposed method explicitly integrates CT kernel and DFOV characteristics into the forward model. Experimental results on clinical data, along with quantitative analysis of the estimated modulation transfer function using wire phantom data, clearly demonstrate the utility of the proposed method in real-time. Additionally, a comparative study with a direct learning network, that lacks forward model information, shows that the proposed method is more robust to DFOV variations.
zh

[CV-102] Pulmonary Tuberculosis Edge Diagnosis System Based on MindSpore Framework: Low-cost and High-precision Implementation with Ascend 310 Chip

【速读】：该论文旨在解决全球范围内，特别是在医疗资源匮乏地区，肺结核（Pulmonary Tuberculosis, PTB）诊断面临的重大挑战。解决方案的关键在于开发了一套基于华为MindSpore框架和Ascend310边缘计算芯片的辅助诊断系统。该系统采用MobileNetV3架构，并结合Softmax交叉熵损失函数及动量优化器，在Orange Pie AIPro (Atlas 200 DK)边缘设备上以FP16混合精度运行，实现了高达99.1%的模型准确率（AUC = 0.99），同时将设备成本控制在150美元以内，从而提供了经济实惠的人工智能辅助诊断方案，适用于初级医疗服务。

链接: https://arxiv.org/abs/2502.14885
作者: HaoYu Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pulmonary Tuberculosis (PTB) remains a major challenge for global health, especially in areas with poor medical resources, where access to specialized medical knowledge and diagnostic tools is limited. This paper presents an auxiliary diagnosis system for pulmonary tuberculosis based on Huawei MindSpore framework and Ascend310 edge computing chip. Using MobileNetV3 architecture and Softmax cross entropy loss function with momentum optimizer. The system operates with FP16 hybrid accuracy on the Orange pie AIPro (Atlas 200 DK) edge device and performs well. In the test set containing 4148 chest images, the model accuracy reached 99.1% (AUC = 0.99), and the equipment cost was controlled within \ 150, providing affordable AI-assisted diagnosis scheme for primary care.
zh

[CV-103] A Baseline Method for Removing Invisible Image Watermarks using Deep Image Prior

【速读】：该论文旨在解决通过黑盒方法移除不可见图像水印的问题。解决方案的关键在于利用深度图像先验（Deep Image Prior, DIP）回归单个水印图像，从DIP的中间步骤中可靠地找到能够去除不可见水印同时保持高图像质量的对抗样本。

链接: https://arxiv.org/abs/2502.13998
作者: Hengyue Liang,Taihui Li,Ju Sun
机构: University of Minnesota, Twin Cities (明尼苏达大学双城分校)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image watermarks have been considered a promising technique to help detect AI-generated content, which can be used to protect copyright or prevent fake image abuse. In this work, we present a black-box method for removing invisible image watermarks, without the need of any dataset of watermarked images or any knowledge about the watermark system. Our approach is simple to implement: given a single watermarked image, we regress it by deep image prior (DIP). We show that from the intermediate steps of DIP one can reliably find an evasion image that can remove invisible watermarks while preserving high image quality. Due to its unique working mechanism and practical effectiveness, we advocate including DIP as a baseline invasion method for benchmarking the robustness of watermarking systems. Finally, by showing the limited ability of DIP and other existing black-box methods in evading training-based visible watermarks, we discuss the positive implications on the practical use of training-based visible watermarks to prevent misinformation abuse.
zh

人工智能

[AI-0] Multi-Agent Architecture in Distributed Environment Control Systems: vision challenges and opportunities

链接: https://arxiv.org/abs/2502.15663
作者: Natasha Astudillo,Fernando Koch
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 6 pages, 1 figure, 1 table

点击查看摘要

Abstract:The increasing demand for energy-efficient solutions in large-scale infrastructure, particularly data centers, requires advanced control strategies to optimize environmental management systems. We propose a multi-agent architecture for distributed control of air-cooled chiller systems in data centers. Our vision employs autonomous agents to monitor and regulate local operational parameters and optimize system-wide efficiency. We demonstrate how this approach improves the responsiveness, operational robustness, and energy efficiency of the system, contributing to the broader goal of sustainable infrastructure management.

[AI-1] Automating Curriculum Learning for Reinforcement Learning using a Skill-Based Bayesian Network

链接: https://arxiv.org/abs/2502.15662
作者: Vincent Hsiao,Mark Roberts,Laura M. Hiatt,George Konidaris,Dana Nau
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A major challenge for reinforcement learning is automatically generating curricula to reduce training time or improve performance in some target task. We introduce SEBNs (Skill-Environment Bayesian Networks) which model a probabilistic relationship between a set of skills, a set of goals that relate to the reward structure, and a set of environment features to predict policy performance on (possibly unseen) tasks. We develop an algorithm that uses the inferred estimates of agent success from SEBN to weigh the possible next tasks by expected improvement. We evaluate the benefit of the resulting curriculum on three environments: a discrete gridworld, continuous control, and simulated robotics. The results show that curricula constructed using SEBN frequently outperform other baselines.

[AI-2] Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

链接: https://arxiv.org/abs/2502.15657
作者: Yoshua Bengio,Michael Cohen,Damiano Fornasiere,Joumana Ghosn,Pietro Greiner,Matt MacDermott,Sören Mindermann,Adam Oberman,Jesse Richardson,Oliver Richardson,Marc-Antoine Rondeau,Pierre-Luc St-Charles,David Williams-King
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The leading AI companies are increasingly focused on building generalist AI agents – systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

[AI-3] AutoTandemML: Active Learning Enhanced Tandem Neural Networks for Inverse Design Problems

链接: https://arxiv.org/abs/2502.15643
作者: Luka Grbcic,Juliane Müller,Wibe Albert de Jong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Inverse design in science and engineering involves determining optimal design parameters that achieve desired performance outcomes, a process often hindered by the complexity and high dimensionality of design spaces, leading to significant computational costs. To tackle this challenge, we propose a novel hybrid approach that combines active learning with Tandem Neural Networks to enhance the efficiency and effectiveness of solving inverse design problems. Active learning allows to selectively sample the most informative data points, reducing the required dataset size without compromising accuracy. We investigate this approach using three benchmark problems: airfoil inverse design, photonic surface inverse design, and scalar boundary condition reconstruction in diffusion partial differential equations. We demonstrate that integrating active learning with Tandem Neural Networks outperforms standard approaches across the benchmark suite, achieving better accuracy with fewer training samples.

[AI-4] Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification

链接: https://arxiv.org/abs/2502.15637
作者: Vasilii Feofanov,Songkang Wen,Marius Alonso,Romain Ilbert,Hongbo Guo,Malik Tiomoko,Lujia Pan,Jianfeng Zhang,Ievgen Redko
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.

[AI-5] he Relationship Between Reasoning and Performance in Large Language Models – o3 (mini) Thinks Harder Not Longer

链接: https://arxiv.org/abs/2502.15631
作者: Marthe Ballon,Andres Algaba,Vincent Ginis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 11 figures

点击查看摘要

Abstract:Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

[AI-6] Dynamic Knowledge Selector and Evaluator for recommendation with Knowledge Graph

链接: https://arxiv.org/abs/2502.15623
作者: Feng Xia,Zhifei Hu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years recommendation systems typically employ the edge information provided by knowledge graphs combined with the advantages of high-order connectivity of graph networks in the recommendation field. However, this method is limited by the sparsity of labels, cannot learn the graph structure well, and a large number of noisy entities in the knowledge graph will affect the accuracy of the recommendation results. In order to alleviate the above problems, we propose a dynamic knowledge-selecting and evaluating method guided by collaborative signals to distill information in the knowledge graph. Specifically, we use a Chain Route Evaluator to evaluate the contributions of different neighborhoods for the recommendation task and employ a Knowledge Selector strategy to filter the less informative knowledge before evaluating. We conduct baseline model comparison and experimental ablation evaluations on three public datasets. The experiments demonstrate that our proposed model outperforms current state-of-the-art baseline models, and each modules effectiveness in our model is demonstrated through ablation experiments.

[AI-7] Paradigms of AI Evaluation: Mapping Goals Methodologies and Culture

链接: https://arxiv.org/abs/2502.15620
作者: John Burden,Marko Tešić,Lorenzo Pacchiardi,José Hernández-Orallo
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research in AI evaluation has grown increasingly complex and multidisciplinary, attracting researchers with diverse backgrounds and objectives. As a result, divergent evaluation paradigms have emerged, often developing in isolation, adopting conflicting terminologies, and overlooking each other’s contributions. This fragmentation has led to insular research trajectories and communication barriers both among different paradigms and with the general public, contributing to unmet expectations for deployed AI systems. To help bridge this insularity, in this paper we survey recent work in the AI evaluation landscape and identify six main paradigms. We characterise major recent contributions within each paradigm across key dimensions related to their goals, methodologies and research cultures. By clarifying the unique combination of questions and approaches associated with each paradigm, we aim to increase awareness of the breadth of current evaluation approaches and foster cross-pollination between different paradigms. We also identify potential gaps in the field to inspire future research directions.

[AI-8] PDeepPP:A Deep learning framework with Pretrained Protein language for peptide classification

链接: https://arxiv.org/abs/2502.15610
作者: Jixiu Zhai,Tianchi Lu,Haitian Zhong,Ziyang Xu,Yuhuan Liu,Xueying Wang,Dan Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures, submitted to arXiv

点击查看摘要

Abstract:Protein post-translational modifications (PTMs) and bioactive peptides (BPs) play critical roles in various biological processes and have significant therapeutic potential. However, identifying PTM sites and bioactive peptides through experimental methods is often labor-intensive, costly, and time-consuming. As a result, computational tools, particularly those based on deep learning, have become effective solutions for predicting PTM sites and peptide bioactivity. Despite progress in this field, existing methods still struggle with the complexity of protein sequences and the challenge of requiring high-quality predictions across diverse datasets. To address these issues, we propose a deep learning framework that integrates pretrained protein language models with a neural network combining transformer and CNN for peptide classification. By leveraging the ability of pretrained models to capture complex relationships within protein sequences, combined with the predictive power of parallel networks, our approach improves feature extraction while enhancing prediction accuracy. This framework was applied to multiple tasks involving PTM site and bioactive peptide prediction, utilizing large-scale datasets to enhance the model’s robustness. In the comparison across 33 tasks, the model achieved state-of-the-art (SOTA) performance in 25 of them, surpassing existing methods and demonstrating its versatility across different datasets. Our results suggest that this approach provides a scalable and effective solution for large-scale peptide discovery and PTM analysis, paving the way for more efficient peptide classification and functional annotation. Comments: 10 pages, 5 figures, submitted to arXiv Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 92C40, 68T07 ACMclasses: I.2.6; J.3 Cite as: arXiv:2502.15610 [cs.LG] (or arXiv:2502.15610v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.15610 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

链接: https://arxiv.org/abs/2502.15602
作者: Yoonjin Chung,Pilsun Eu,Junwon Lee,Keunwoo Choi,Juhan Nam,Ben Sangbae Chon
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Although being widely adopted for evaluating generated audio signals, the Fréchet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD’s advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

[AI-10] Improving the Scaling Laws of Synthetic Data with Deliberate Practice

链接: https://arxiv.org/abs/2502.15588
作者: Reyhane Askari-Hemmat,Mohammad Pezeshki,Elvis Dohmatob,Florian Bordes,Pietro Astolfi,Melissa Hall,Jakob Verbeek,Michal Drozdzal,Adriana Romero-Soriano
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

[AI-11] A Cautionary Tale About “Neutrally” Informative AI Tools Ahead of the 2025 Federal Elections in Germany

链接: https://arxiv.org/abs/2502.15568
作者: Ina Dormuth,Sven Franke,Marlies Hafer,Tim Katzke,Alexander Marx,Emmanuel Müller,Daniel Neider,Markus Pauly,Jérôme Rutinowski
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties’ stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.

[AI-12] Zweistein: A Dynamic Programming Evaluation Function for Einstein Würfelt Nicht!

链接: https://arxiv.org/abs/2502.15547
作者: Wei Lin. Hsueh,Tsan Sheng. Hsu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces Zweistein, a dynamic programming evaluation function for Einstein Würfelt Nicht! (EWN). Instead of relying on human knowledge to craft an evaluation function, Zweistein uses a data-centric approach that eliminates the need for parameter tuning. The idea is to use a vector recording the distance to the corner of all pieces. This distance vector captures the essence of EWN. It not only outperforms many traditional EWN evaluation functions but also won first place in the TCGA 2023 competition.

[AI-13] Bridging Domain Gaps between Pretrained Multimodal Models and Recommendations

链接: https://arxiv.org/abs/2502.15542
作者: Wenyu Zhang,Jie Luo,Xinming Zhang,Yuan Fang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the explosive growth of multimodal content online, pre-trained visual-language models have shown great potential for multimodal recommendation. However, while these models achieve decent performance when applied in a frozen manner, surprisingly, due to significant domain gaps (e.g., feature distribution discrepancy and task objective misalignment) between pre-training and personalized recommendation, adopting a joint training approach instead leads to performance worse than baseline. Existing approaches either rely on simple feature extraction or require computationally expensive full model fine-tuning, struggling to balance effectiveness and efficiency. To tackle these challenges, we propose \textbfParameter-efficient \textbfTuning for \textbfMultimodal \textbfRecommendation (\textbfPTMRec), a novel framework that bridges the domain gap between pre-trained models and recommendation systems through a knowledge-guided dual-stage parameter-efficient training strategy. This framework not only eliminates the need for costly additional pre-training but also flexibly accommodates various parameter-efficient tuning methods.

[AI-14] PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System ASPLOS2025

链接: https://arxiv.org/abs/2502.15470
作者: Yintao He,Haiyu Mao,Christina Giannoula,Mohammad Sadrosadati,Juan Gómez-Luna,Huawei Li,Xiaowei Li,Ying Wang,Onur Mutlu
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear in ASPLOS 2025

点击查看摘要

Abstract:Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8 \times and 11.1 \times speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively. Comments: To appear in ASPLOS 2025 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2502.15470 [cs.AR] (or arXiv:2502.15470v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2502.15470 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

链接: https://arxiv.org/abs/2502.15466
作者: Wenxuan Wang,Kai Wu,Yujian Betterest Li,Dan Wang,Xiaoyu Zhang,Jing Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as data scarcity and data imbalance continue to hinder their development. To address this, we consider modeling complex systems through symbolic expressions that serve as semantic descriptors of time series. Building on this concept, we introduce a series-symbol (S2) dual-modulity data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic representations. Leveraging the S2 dataset, we develop SymTime, a pre-trained foundation model for TSA. SymTime demonstrates competitive performance across five major TSA tasks when fine-tuned with downstream task, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of dual-modality data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance.

[AI-16] R-LoRA: Random Initialization of Multi-Head LoRA for Multi-Task Learning

链接: https://arxiv.org/abs/2502.15455
作者: Jinda Liu,Yi Chang,Yuan Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 10 figures

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is prohibitively expensive in terms of computational and memory costs. Low-rank Adaptation (LoRA), as one of the most popular parameter-efficient fine-tuning (PEFT) methods, offers a cost-effective alternative by approximating the model changes \Delta W \in \mathbbR^m \times n through the product of down-projection matrix A \in \mathbbR^m \times r and head matrix B \in \mathbbR^r \times n , where r \ll \min(m, n) . In real-world scenarios, LLMs are fine-tuned on data from multiple domains to perform tasks across various fields, embodying multi-task learning (MTL). LoRA often underperforms in such complex scenarios. To enhance LoRA’s capability in multi-task learning, we propose R-LoRA, which incorporates Multi-Head Randomization. Multi-Head Randomization diversifies the head matrices through Multi-Head Random Initialization and Multi-Head Dropout, enabling more efficient learning of task-specific features while maintaining shared knowledge representation. Extensive experiments demonstrate that R-LoRA is better at capturing task-specific knowledge, thereby improving performance in multi-task scenarios. The code is available at this https URL.

[AI-17] AG: A Decentralized Framework for Multi-Agent Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2502.15425
作者: Giuseppe Paolo,Abdelhakim Benechehab,Hamza Cherkaoui,Albert Thomas,Balázs Kégl
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Hierarchical organization is fundamental to biological systems and human societies, yet artificial intelligence systems often rely on monolithic architectures that limit adaptability and scalability. Current hierarchical reinforcement learning (HRL) approaches typically restrict hierarchies to two levels or require centralized training, which limits their practical applicability. We introduce TAME Agent Framework (TAG), a framework for constructing fully decentralized hierarchical multi-agent this http URL enables hierarchies of arbitrary depth through a novel LevelEnv concept, which abstracts each hierarchy level as the environment for the agents above it. This approach standardizes information flow between levels while preserving loose coupling, allowing for seamless integration of diverse agent types. We demonstrate the effectiveness of TAG by implementing hierarchical architectures that combine different RL agents across multiple levels, achieving improved performance over classical multi-agent RL baselines on standard benchmarks. Our results show that decentralized hierarchical organization enhances both learning speed and final performance, positioning TAG as a promising direction for scalable multi-agent systems.

[AI-18] Integrating Generative AI in Cybersecurity Education: Case Study Insights on Pedagogical Strategies Critical Thinking and Responsible AI Use

链接: https://arxiv.org/abs/2502.15357
作者: Mahmoud Elkhodr,Ergun Gide
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 30 pages

点击查看摘要

Abstract:The rapid advancement of Generative Artificial Intelligence (GenAI) has introduced new opportunities for transforming higher education, particularly in fields that require analytical reasoning and regulatory compliance, such as cybersecurity management. This study presents a structured framework for integrating GenAI tools into cybersecurity education, demonstrating their role in fostering critical thinking, real-world problem-solving, and regulatory awareness. The implementation strategy followed a two-stage approach, embedding GenAI within tutorial exercises and assessment tasks. Tutorials enabled students to generate, critique, and refine AI-assisted cybersecurity policies, while assessments required them to apply AI-generated outputs to real-world scenarios, ensuring alignment with industry standards and regulatory requirements. Findings indicate that AI-assisted learning significantly enhanced students’ ability to evaluate security policies, refine risk assessments, and bridge theoretical knowledge with practical application. Student reflections and instructor observations revealed improvements in analytical engagement, yet challenges emerged regarding AI over-reliance, variability in AI literacy, and the contextual limitations of AI-generated content. Through structured intervention and research-driven refinement, students were able to recognize AI strengths as a generative tool while acknowledging its need for human oversight. This study further highlights the broader implications of AI adoption in cybersecurity education, emphasizing the necessity of balancing automation with expert judgment to cultivate industry-ready professionals. Future research should explore the long-term impact of AI-driven learning on cybersecurity competency, as well as the potential for adaptive AI-assisted assessments to further personalize and enhance educational outcomes.

[AI-19] Exploring Embodied Multimodal Large Models: Development Datasets and Future Directions

链接: https://arxiv.org/abs/2502.15336
作者: Shoubin Chen,Zehao Wu,Kai Zhang,Chunyu Li,Baiyang Zhang,Fei Ma,Fei Richard Yu,Qingquan Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 81 pages, submitted to a journal for review

点击查看摘要

Abstract:Embodied multimodal large models (EMLMs) have gained significant attention in recent years due to their potential to bridge the gap between perception, cognition, and action in complex, real-world environments. This comprehensive review explores the development of such models, including Large Language Models (LLMs), Large Vision Models (LVMs), and other models, while also examining other emerging architectures. We discuss the evolution of EMLMs, with a focus on embodied perception, navigation, interaction, and simulation. Furthermore, the review provides a detailed analysis of the datasets used for training and evaluating these models, highlighting the importance of diverse, high-quality data for effective learning. The paper also identifies key challenges faced by EMLMs, including issues of scalability, generalization, and real-time decision-making. Finally, we outline future directions, emphasizing the integration of multimodal sensing, reasoning, and action to advance the development of increasingly autonomous systems. By providing an in-depth analysis of state-of-the-art methods and identifying critical gaps, this paper aims to inspire future advancements in EMLMs and their applications across diverse domains.

[AI-20] Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

链接: https://arxiv.org/abs/2502.15334
作者: Pedram Zaree,Md Abdullah Al Mamun,Quazi Mishkatul Alam,Yue Dong,Ihsen Alouani,Nael Abu-Ghazaleh
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

[AI-21] Lightweight yet Efficient: An External Attentive Graph Convolutional Network with Positional Prompts for Sequential Recommendation

链接: https://arxiv.org/abs/2502.15331
作者: Jinyu Zhang,Chao Li,Zhongying Zhao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 26 pages, 8 figures, journal paper, accepted by TOIS at 20th February, 2025

点击查看摘要

Abstract:Graph-based Sequential Recommender systems (GSRs) have gained significant research attention due to their ability to simultaneously handle user-item interactions and sequential relationships between items. Current GSRs often utilize composite or in-depth structures for graph encoding (e.g., the Graph Transformer). Nevertheless, they have high computational complexity, hindering the deployment on resource-constrained edge devices. Moreover, the relative position encoding in Graph Transformer has difficulty in considering the complicated positional dependencies within sequence. To this end, we propose an External Attentive Graph convolutional network with Positional prompts for Sequential recommendation, namely EA-GPS. Specifically, we first introduce an external attentive graph convolutional network that linearly measures the global associations among nodes via two external memory units. Then, we present a positional prompt-based decoder that explicitly treats the absolute item positions as external prompts. By introducing length-adaptive sequential masking and a soft attention network, such a decoder facilitates the model to capture the long-term positional dependencies and contextual relationships within sequences. Extensive experimental results on five real-world datasets demonstrate that the proposed EA-GPS outperforms the state-of-the-art methods. Remarkably, it achieves the superior performance while maintaining a smaller parameter size and lower training overhead. The implementation of this work is publicly available at this https URL.

[AI-22] Beyond Fixed Variables: Expanding-variate Time Series Forecasting via Flat Scheme and Spatio-temporal Focal Learning

链接: https://arxiv.org/abs/2502.15296
作者: Minbo Ma,Kai Tang,Huan Li,Fei Teng,Dalin Zhang,Tianrui Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multivariate Time Series Forecasting (MTSF) has long been a key research focus. Traditionally, these studies assume a fixed number of variables, but in real-world applications, Cyber-Physical Systems often expand as new sensors are deployed, increasing variables in MTSF. In light of this, we introduce a novel task, Expanding-variate Time Series Forecasting (EVTSF). This task presents unique challenges, specifically (1) handling inconsistent data shapes caused by adding new variables, and (2) addressing imbalanced spatio-temporal learning, where expanding variables have limited observed data due to the necessity for timely operation. To address these challenges, we propose STEV, a flexible spatio-temporal forecasting framework. STEV includes a new Flat Scheme to tackle the inconsistent data shape issue, which extends the graph-based spatio-temporal modeling architecture into 1D space by flattening the 2D samples along the variable dimension, making the model variable-scale-agnostic while still preserving dynamic spatial correlations through a holistic graph. We introduce a novel Spatio-temporal Focal Learning strategy that incorporates a negative filter to resolve potential conflicts between contrastive learning and graph representation, and a focal contrastive loss as its core to guide the framework to focus on optimizing the expanding variables. We benchmark EVTSF performance using three real-world datasets and compare it against three potential solutions employing SOTA MTSF models tailored for EVSTF. Experimental results show that STEV significantly outperforms its competitors, particularly on expanding variables. Notably, STEV, with only 5% of observations from the expanding period, is on par with SOTA MTSF models trained with complete observations. Further exploration of various expanding strategies underscores the generalizability of STEV in real-world applications.

[AI-23] me Warp: The Gap Between Developers Ideal vs Actual Workweeks in an AI-Driven Era ICSE

链接: https://arxiv.org/abs/2502.15287
作者: Sukrit Kumar,Drishti Goel,Thomas Zimmermann,Brian Houck,B. Ashok,Chetan Bansal
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: ICSE SEIP 2025

点击查看摘要

Abstract:Software developers balance a variety of different tasks in a workweek, yet the allocation of time often differs from what they consider ideal. Identifying and addressing these deviations is crucial for organizations aiming to enhance the productivity and well-being of the developers. In this paper, we present the findings from a survey of 484 software developers at Microsoft, which aims to identify the key differences between how developers would like to allocate their time during an ideal workweek versus their actual workweek. Our analysis reveals significant deviations between a developer’s ideal workweek and their actual workweek, with a clear correlation: as the gap between these two workweeks widens, we observe a decline in both productivity and satisfaction. By examining these deviations in specific activities, we assess their direct impact on the developers’ satisfaction and productivity. Additionally, given the growing adoption of AI tools in software engineering, both in the industry and academia, we identify specific tasks and areas that could be strong candidates for automation. In this paper, we make three key contributions: 1) We quantify the impact of workweek deviations on developer productivity and satisfaction 2) We identify individual tasks that disproportionately affect satisfaction and productivity 3) We provide actual data-driven insights to guide future AI automation efforts in software engineering, aligning them with the developers’ requirements and ideal workflows for maximizing their productivity and satisfaction.

[AI-24] Offload Rethinking by Cloud Assistance for Efficient Environmental Sound Recognition on LPWANs

链接: https://arxiv.org/abs/2502.15285
作者: Le Zhang,Quanling Zhao,Run Wang,Shirley Bian,Onat Gungor,Flavio Ponzina,Tajana Rosing
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Learning-based environmental sound recognition has emerged as a crucial method for ultra-low-power environmental monitoring in biological research and city-scale sensing systems. These systems usually operate under limited resources and are often powered by harvested energy in remote areas. Recent efforts in on-device sound recognition suffer from low accuracy due to resource constraints, whereas cloud offloading strategies are hindered by high communication costs. In this work, we introduce ORCA, a novel resource-efficient cloud-assisted environmental sound recognition system on batteryless devices operating over the Low-Power Wide-Area Networks (LPWANs), targeting wide-area audio sensing applications. We propose a cloud assistance strategy that remedies the low accuracy of on-device inference while minimizing the communication costs for cloud offloading. By leveraging a self-attention-based cloud sub-spectral feature selection method to facilitate efficient on-device inference, ORCA resolves three key challenges for resource-constrained cloud offloading over LPWANs: 1) high communication costs and low data rates, 2) dynamic wireless channel conditions, and 3) unreliable offloading. We implement ORCA on an energy-harvesting batteryless microcontroller and evaluate it in a real world urban sound testbed. Our results show that ORCA outperforms state-of-the-art methods by up to 80 \times in energy savings and 220 \times in latency reduction while maintaining comparable accuracy.

[AI-25] ComposeOn Academy: Transforming Melodic Ideas into Complete Compositions Integrating Music Learning

链接: https://arxiv.org/abs/2502.15255
作者: Hongxi Pu,Futian Jiang,Zihao Chen,Xingyue Song
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Music composition has long been recognized as a significant art form. However, existing digital audio workstations and music production software often present high entry barriers for users lacking formal musical training. To address this, we introduce ComposeOn, a music theory-based tool designed for users with limited musical knowledge. ComposeOn enables users to easily extend their melodic ideas into complete compositions and offers simple editing features. By integrating music theory, it explains music creation at beginner, intermediate, and advanced levels. Our user study (N=10) compared ComposeOn with the baseline method, Suno AI, demonstrating that ComposeOn provides a more accessible and enjoyable composing and learning experience for individuals with limited musical skills. ComposeOn bridges the gap between theory and practice, offering an innovative solution as both a composition aid and music education platform. The study also explores the differences between theory-based music creation and generative music, highlighting the former’s advantages in personal expression and learning.

[AI-26] Comparative Analysis of Large Language Models for Context-Aware Code Completion using SAFIM Framework

链接: https://arxiv.org/abs/2502.15243
作者: Hang Zhang,Yanxin Shen,Lun Wang,Chuanqi Shi,Shaoshuai Du,Yiyi Tao,Yixian Shen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has revolutionized code completion, transforming it into a more intelligent and context-aware feature in modern integrated development environments. These advancements have significantly enhanced developers’ ability to write efficient and error-free code. This study evaluates the performance of several chat-based LLMs, including Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, GPT-4o-mini, and GPT-4 Turbo, using the Syntax-Aware Fill-in-the-Middle (SAFIM) dataset. This benchmark is specifically designed to assess models’ capabilities in syntax-sensitive code generation. Performance metrics, such as cosine similarity with ground-truth completions and latency, were employed to measure both accuracy and efficiency. The findings reveal substantial differences in the models’ code completion abilities, offering valuable insights into their respective strengths and weaknesses. This work provides a comparative analysis that underscores the trade-offs between accuracy and speed, establishing a benchmark for future advancements in LLM-based code completion.

[AI-27] Auto-Bench: An Automated Benchmark for Scientific Discovery in LLM s

链接: https://arxiv.org/abs/2502.15224
作者: Tingting Chen,Srinivas Anumasa,Beibei Lin,Vedant Shah,Anirudh Goyal,Dianbo Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textitAuto-Bench, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.

[AI-28] FormalSpecCpp: A Dataset of C Formal Specifications created using LLM s

链接: https://arxiv.org/abs/2502.15217
作者: Madhurima Chakraborty,Peter Pirkelbauer,Qing Yi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Accepted at the 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)

点击查看摘要

Abstract:FormalSpecCpp is a dataset designed to fill the gap in standardized benchmarks for verifying formal specifications in C++ programs. To the best of our knowledge, this is the first comprehensive collection of C++ programs with well-defined preconditions and postconditions. It provides a structured benchmark for evaluating specification inference tools and testing theaccuracy of generated specifications. Researchers and developers can use this dataset to benchmark specification inference tools,fine-tune Large Language Models (LLMs) for automated specification generation, and analyze the role of formal specifications in improving program verification and automated testing. By making this dataset publicly available, we aim to advance research in program verification, specification inference, and AI-assisted software development. The dataset and the code are available at this https URL.

[AI-29] Measuring AI agent autonomy: Towards a scalable approach with code inspection NEURIPS

链接: https://arxiv.org/abs/2502.15212
作者: Peter Cihon,Merlin Stein,Gagan Bansal,Sam Manning,Kevin Xu
类目: Artificial Intelligence (cs.AI)
*备注: NeurIPS Socially Responsible Language Modelling Research (SoLaR) Workshop 2024

点击查看摘要

Abstract:AI agents are AI systems that can achieve complex goals autonomously. Assessing the level of agent autonomy is crucial for understanding both their potential benefits and risks. Current assessments of autonomy often focus on specific risks and rely on run-time evaluations – observations of agent actions during operation. We introduce a code-based assessment of autonomy that eliminates the need to run an AI agent to perform specific tasks, thereby reducing the costs and risks associated with run-time evaluations. Using this code-based framework, the orchestration code used to run an AI agent can be scored according to a taxonomy that assesses attributes of autonomy: impact and oversight. We demonstrate this approach with the AutoGen framework and select applications.

[AI-30] LEDD: Large Language Model-Empowered Data Discovery in Data Lakes

链接: https://arxiv.org/abs/2502.15182
作者: Qi An,Chihua Ying,Yuqing Zhu,Yihao Xu,Manwei Zhang,Jianmin Wang
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data discovery in data lakes with ever increasing datasets has long been recognized as a big challenge in the realm of data management, especially for semantic search of and hierarchical global catalog generation of tables. While large language models (LLMs) facilitate the processing of data semantics, challenges remain in architecting an end-to-end system that comprehensively exploits LLMs for the two semantics-related tasks. In this demo, we propose LEDD, an end-to-end system with an extensible architecture that leverages LLMs to provide hierarchical global catalogs with semantic meanings and semantic table search for data lakes. Specifically, LEDD can return semantically related tables based on natural-language specification. These features make LEDD an ideal foundation for downstream tasks such as model training and schema linking for text-to-SQL tasks. LEDD also provides a simple Python interface to facilitate the extension and the replacement of data discovery algorithms.

[AI-31] Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

链接: https://arxiv.org/abs/2502.15145
作者: Nuoya Xiong,Aarti Singh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning model, particularly Language Model (LM) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices. %, e.g. compare two papers on their novelty, clarity, correctness, etc. Multi-Objective RLHF (MORLHF) aims to use per-objective preference feedback and achieve Pareto optimality among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.

[AI-32] he Imitation Game for Educational AI

链接: https://arxiv.org/abs/2502.15127
作者: Shashank Sonkar,Naiming Liu,Xinghe Chen,Richard G. Baraniuk
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:As artificial intelligence systems become increasingly prevalent in education, a fundamental challenge emerges: how can we verify if an AI truly understands how students think and reason? Traditional evaluation methods like measuring learning gains require lengthy studies confounded by numerous variables. We present a novel evaluation framework based on a two-phase Turing-like test. In Phase 1, students provide open-ended responses to questions, revealing natural misconceptions. In Phase 2, both AI and human experts, conditioned on each student’s specific mistakes, generate distractors for new related questions. By analyzing whether students select AI-generated distractors at rates similar to human expert-generated ones, we can validate if the AI models student cognition. We prove this evaluation must be conditioned on individual responses - unconditioned approaches merely target common misconceptions. Through rigorous statistical sampling theory, we establish precise requirements for high-confidence validation. Our research positions conditioned distractor generation as a probe into an AI system’s fundamental ability to model student thinking - a capability that enables adapting tutoring, feedback, and assessments to each student’s specific needs.

[AI-33] Fundamental Survey on Neuromorphic Based Audio Classification

链接: https://arxiv.org/abs/2502.15056
作者: Amlan Basu,Pranav Chaudhari,Gaetano Di Caterina
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 24 Pages, 1 Table

点击查看摘要

Abstract:Audio classification is paramount in a variety of applications including surveillance, healthcare monitoring, and environmental analysis. Traditional methods frequently depend on intricate signal processing algorithms and manually crafted features, which may fall short in fully capturing the complexities of audio patterns. Neuromorphic computing, inspired by the architecture and functioning of the human brain, presents a promising alternative for audio classification tasks. This survey provides an exhaustive examination of the current state-of-the-art in neuromorphic-based audio classification. It delves into the crucial components of neuromorphic systems, such as Spiking Neural Networks (SNNs), memristors, and neuromorphic hardware platforms, highlighting their advantages in audio classification. Furthermore, the survey explores various methodologies and strategies employed in neuromorphic audio classification, including event-based processing, spike-based learning, and bio-inspired feature extraction. It examines how these approaches address the limitations of traditional audio classification methods, particularly in terms of energy efficiency, real-time processing, and robustness to environmental noise. Additionally, the paper conducts a comparative analysis of different neuromorphic audio classification models and benchmarks, evaluating their performance metrics, computational efficiency, and scalability. By providing a comprehensive guide for researchers, engineers and practitioners, this survey aims to stimulate further innovation and advancements in the evolving field of neuromorphic audio classification.

[AI-34] DEFT: Differentiable Branched Discrete Elastic Rods for Modeling Furcated DLOs in Real-Time

链接: https://arxiv.org/abs/2502.15037
作者: Yizhou Chen,Xiaoyue Wu,Yeheng Zong,Anran Li,Yuzhen Chen,Julie Wu,Bohao Zhang,Ram Vasudevan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Autonomous wire harness assembly requires robots to manipulate complex branched cables with high precision and reliability. A key challenge in automating this process is predicting how these flexible and branched structures behave under manipulation. Without accurate predictions, it is difficult for robots to reliably plan or execute assembly operations. While existing research has made progress in modeling single-threaded Deformable Linear Objects (DLOs), extending these approaches to Branched Deformable Linear Objects (BDLOs) presents fundamental challenges. The junction points in BDLOs create complex force interactions and strain propagation patterns that cannot be adequately captured by simply connecting multiple single-DLO models. To address these challenges, this paper presents Differentiable discrete branched Elastic rods for modeling Furcated DLOs in real-Time (DEFT), a novel framework that combines a differentiable physics-based model with a learning framework to: 1) accurately model BDLO dynamics, including dynamic propagation at junction points and grasping in the middle of a BDLO, 2) achieve efficient computation for real-time inference, and 3) enable planning to demonstrate dexterous BDLO manipulation. A comprehensive series of real-world experiments demonstrates DEFT’s efficacy in terms of accuracy, computational speed, and generalizability compared to state-of-the-art alternatives. Project page:this https URL.

[AI-35] owards Physics-Guided Foundation Models

链接: https://arxiv.org/abs/2502.15013
作者: Majid Farhadloo,Arun Sharma,Mingzhou Yang,Bharat Jayaprakash,William Northrop,Shashi Shekhar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional foundation models are pre-trained on broad datasets to reduce the training resources (e.g., time, energy, labeled samples) needed for fine-tuning a wide range of downstream tasks. However, traditional foundation models struggle with out-of-distribution prediction and can produce outputs that are unrealistic and physically infeasible. We propose the notation of physics-guided foundation models (PGFM), that is, foundation models integrated with broad or general domain (e.g., scientific) physical knowledge applicable to a wide range of downstream tasks.

[AI-36] Graph in the Vault: Protecting Edge GNN Inference with Trusted Execution Environment

链接: https://arxiv.org/abs/2502.15012
作者: Ruyi Ding,Tianhong Xu,Aidong Adam Ding,Yunsi Fei
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: This work is accepted by DAC 2025

点击查看摘要

Abstract:Wide deployment of machine learning models on edge devices has rendered the model intellectual property (IP) and data privacy vulnerable. We propose GNNVault, the first secure Graph Neural Network (GNN) deployment strategy based on Trusted Execution Environment (TEE). GNNVault follows the design of ‘partition-before-training’ and includes a private GNN rectifier to complement with a public backbone model. This way, both critical GNN model parameters and the private graph used during inference are protected within secure TEE compartments. Real-world implementations with Intel SGX demonstrate that GNNVault safeguards GNN inference against state-of-the-art link stealing attacks with negligible accuracy degradation (2%).

[AI-37] Safe Beyond the Horizon: Efficient Sampling-based MPC with Neural Control Barrier Functions

链接: https://arxiv.org/abs/2502.15006
作者: Ji Yin,Oswin So,Eric Yang Yu,Chuchu Fan,Panagiotis Tsiotras
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:A common problem when using model predictive control (MPC) in practice is the satisfaction of safety specifications beyond the prediction horizon. While theoretical works have shown that safety can be guaranteed by enforcing a suitable terminal set constraint or a sufficiently long prediction horizon, these techniques are difficult to apply and thus are rarely used by practitioners, especially in the case of general nonlinear dynamics. To solve this problem, we impose a tradeoff between exact recursive feasibility, computational tractability, and applicability to ‘‘black-box’’ dynamics by learning an approximate discrete-time control barrier function and incorporating it into a variational inference MPC (VIMPC), a sampling-based MPC paradigm. To handle the resulting state constraints, we further propose a new sampling strategy that greatly reduces the variance of the estimated optimal control, improving the sample efficiency, and enabling real-time planning on a CPU. The resulting Neural Shield-VIMPC (NS-VIMPC) controller yields substantial safety improvements compared to existing sampling-based MPC controllers, even under badly designed cost functions. We validate our approach in both simulation and real-world hardware experiments.

[AI-38] CyberSentinel: An Emergent Threat Detection System for AI Security

链接: https://arxiv.org/abs/2502.14966
作者: Krti Tallam
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence (AI) has significantly expanded the attack surface for AI-driven cybersecurity threats, necessitating adaptive defense strategies. This paper introduces CyberSentinel, a unified, single-agent system for emergent threat detection, designed to identify and mitigate novel security risks in real time. CyberSentinel integrates: (1) Brute-force attack detection through SSH log analysis, (2) Phishing threat assessment using domain blacklists and heuristic URL scoring, and (3) Emergent threat detection via machine learning-based anomaly detection. By continuously adapting to evolving adversarial tactics, CyberSentinel strengthens proactive cybersecurity defense, addressing critical vulnerabilities in AI security.

[AI-39] Why do Experts Disagree on Existential Risk and P(doom)? A Survey of AI Experts

链接: https://arxiv.org/abs/2502.14870
作者: Severin Field
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: In submission to AI and Ethics Journal. 24 pages total, 15 pages of writing with 9 pages of appendices

点击查看摘要

Abstract:The development of artificial general intelligence (AGI) is likely to be one of humanity’s most consequential technological advancements. Leading AI labs and scientists have called for the global prioritization of AI safety citing existential risks comparable to nuclear war. However, research on catastrophic risks and AI alignment is often met with skepticism, even by experts. Furthermore, online debate over the existential risk of AI has begun to turn tribal (e.g. name-calling such as “doomer” or “accelerationist”). Until now, no systematic study has explored the patterns of belief and the levels of familiarity with AI safety concepts among experts. I surveyed 111 AI experts on their familiarity with AI safety concepts, key objections to AI safety, and reactions to safety arguments. My findings reveal that AI experts cluster into two viewpoints – an “AI as controllable tool” and an “AI as uncontrollable agent” perspective – diverging in beliefs toward the importance of AI safety. While most experts (78%) agreed or strongly agreed that “technical AI researchers should be concerned about catastrophic risks”, many were unfamiliar with specific AI safety concepts. For example, only 21% of surveyed experts had heard of “instrumental convergence,” a fundamental concept in AI safety predicting that advanced AI systems will tend to pursue common sub-goals (such as self-preservation). The least concerned participants were the least familiar with concepts like this, suggesting that effective communication of AI safety should begin with establishing clear conceptual foundations in the field.

[AI-40] Envisioning Stakeholder-Action Pairs to Mitigate Negative Impacts of AI: A Participatory Approach to Inform Policy Making

链接: https://arxiv.org/abs/2502.14869
作者: Julia Barnett,Kimon Kieslich,Natali Helberger,Nicholas Diakopoulos
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 14 pages + supplementary information and appendix

点击查看摘要

Abstract:The potential for negative impacts of AI has rapidly become more pervasive around the world, and this has intensified a need for responsible AI governance. While many regulatory bodies endorse risk-based approaches and a multitude of risk mitigation practices are proposed by companies and academic scholars, these approaches are commonly expert-centered and thus lack the inclusion of a significant group of stakeholders. Ensuring that AI policies align with democratic expectations requires methods that prioritize the voices and needs of those impacted. In this work we develop a participative and forward-looking approach to inform policy-makers and academics that grounds the needs of lay stakeholders at the forefront and enriches the development of risk mitigation strategies. Our approach (1) maps potential mitigation and prevention strategies of negative AI impacts that assign responsibility to various stakeholders, (2) explores the importance and prioritization thereof in the eyes of laypeople, and (3) presents these insights in policy fact sheets, i.e., a digestible format for informing policy processes. We emphasize that this approach is not targeted towards replacing policy-makers; rather our aim is to present an informative method that enriches mitigation strategies and enables a more participatory approach to policy development.

[AI-41] Unlocking the Black Box: Analysing the EU Artificial Intelligence Acts Framework for Explainability in AI

链接: https://arxiv.org/abs/2502.14868
作者: Georgios Pavlidis
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The lack of explainability of Artificial Intelligence (AI) is one of the first obstacles that the industry and regulators must overcome to mitigate the risks associated with the technology. The need for eXplainable AI (XAI) is evident in fields where accountability, ethics and fairness are critical, such as healthcare, credit scoring, policing and the criminal justice system. At the EU level, the notion of explainability is one of the fundamental principles that underpin the AI Act, though the exact XAI techniques and requirements are still to be determined and tested in practice. This paper explores various approaches and techniques that promise to advance XAI, as well as the challenges of implementing the principle of explainability in AI governance and policies. Finally, the paper examines the integration of XAI into EU law, emphasising the issues of standard setting, oversight, and enforcement.

[AI-42] Intelligent Tutors for Adult Learners: An Analysis of Needs and Challenges

链接: https://arxiv.org/abs/2412.04477
作者: Adit Gupta,Momin Siddiqui,Glen Smith,Jenn Reddig,Christopher MacLellan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This work examines the sociotechnical factors that influence the adoption and usage of intelligent tutoring systems in self-directed learning contexts, focusing specifically on adult learners. The study is divided into two parts. First, we present Apprentice Tutors, a novel intelligent tutoring system designed to address the unique needs of adult learners. The platform includes adaptive problem selection, real-time feedback, and visual dashboards to support learning in college algebra topics. Second, we investigate the specific needs and experiences of adult users through a deployment study and a series of focus groups. Using thematic analysis, we identify key challenges and opportunities to improve tutor design and adoption. Based on these findings, we offer actionable design recommendations to help developers create intelligent tutoring systems that better align with the motivations and learning preferences of adult learners. This work contributes to a wider understanding of how to improve educational technologies to support lifelong learning and professional development.

[AI-43] Feature maps for the Laplacian kernel and its generalizations

链接: https://arxiv.org/abs/2502.15575
作者: Sudhendu Ahir,Parthe Pandit
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent applications of kernel methods in machine learning have seen a renewed interest in the Laplacian kernel, due to its stability to the bandwidth hyperparameter in comparison to the Gaussian kernel, as well as its expressivity being equivalent to that of the neural tangent kernel of deep fully connected networks. However, unlike the Gaussian kernel, the Laplacian kernel is not separable. This poses challenges for techniques to approximate it, especially via the random Fourier features (RFF) methodology and its variants. In this work, we provide random features for the Laplacian kernel and its two generalizations: Matérn kernel and the Exponential power kernel. We provide efficiently implementable schemes to sample weight matrices so that random features approximate these kernels. These weight matrices have a weakly coupled heavy-tailed randomness. Via numerical experiments on real datasets we demonstrate the efficacy of these random feature maps.

[AI-44] BAN: Neuroanatomical Aligning in Auditory Recognition between Artificial Neural Network and Human Cortex

链接: https://arxiv.org/abs/2502.15503
作者: Haidong Wang,Pengfei Xiao,Ao Liu,Jianhua Zhang,Qia Shan
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Drawing inspiration from neurosciences, artificial neural networks (ANNs) have evolved from shallow architectures to highly complex, deep structures, yielding exceptional performance in auditory recognition tasks. However, traditional ANNs often struggle to align with brain regions due to their excessive depth and lack of biologically realistic features, like recurrent connection. To address this, a brain-like auditory network (BAN) is introduced, which incorporates four neuroanatomically mapped areas and recurrent connection, guided by a novel metric called the brain-like auditory score (BAS). BAS serves as a benchmark for evaluating the similarity between BAN and human auditory recognition pathway. We further propose that specific areas in the cerebral cortex, mainly the middle and medial superior temporal (T2/T3) areas, correspond to the designed network structure, drawing parallels with the brain’s auditory perception pathway. Our findings suggest that the neuroanatomical similarity in the cortex and auditory classification abilities of the ANN are well-aligned. In addition to delivering excellent performance on a music genre classification task, the BAN demonstrates a high BAS score. In conclusion, this study presents BAN as a recurrent, brain-inspired ANN, representing the first model that mirrors the cortical pathway of auditory recognition.

[AI-45] Super-Resolution for Interferometric Imaging: Model Comparisons and Performance Analysis

链接: https://arxiv.org/abs/2502.15397
作者: Hasan Berkay Abdioglu,Rana Gursoy,Yagmur Isik,Ibrahim Cem Balci,Taha Unal,Kerem Bayer,Mustafa Ismail Inal,Nehir Serin,Muhammed Furkan Kosar,Gokhan Bora Esmer,Huseyin Uvet
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates the application of Super-Resolution techniques in holographic microscopy to enhance quantitative phase imaging. An off-axis Mach-Zehnder interferometric setup was employed to capture interferograms. The study evaluates two Super-Resolution models, RCAN and Real-ESRGAN, for their effectiveness in reconstructing high-resolution interferograms from a microparticle-based dataset. The models were assessed using two primary approaches: image-based analysis for structural detail enhancement and morphological evaluation for maintaining sample integrity and phase map accuracy. The results demonstrate that RCAN achieves superior numerical precision, making it ideal for applications requiring highly accurate phase map reconstruction, while Real-ESRGAN enhances visual quality and structural coherence, making it suitable for visualization-focused applications. This study highlights the potential of Super-Resolution models in overcoming diffraction-imposed resolution limitations in holographic microscopy, opening the way for improved imaging techniques in biomedical diagnostics, materials science, and other high-precision fields.

[AI-46] Key Body Posture Characteristics of Short-distance Speed Skaters at the Start Based on Artificial Intelligence

链接: https://arxiv.org/abs/2502.15185
作者: Zhang Xueliana,Fang Yingjieb,Liu Hang
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Objective To conduct biomechanical analysis on the starting technique of male short-distance speed skating athletes in China and determine the key factors affecting the effectiveness of the starting movement. Methods 13 high-level male short-distance speed skating athletes were selected as the test subjects, and kinematic data were collected using an artificial intelligence video capture and analysis system. The body posture features and their effects on the starting movement performance were analyzed in the three stages of starting preparation, starting, and sprinting. Results The post-stability angle, anterior knee angle of the front leg, posterior knee angle of the rear leg, and stride length showed moderate to high positive correlations with the starting speed during the starting preparation stage. The trunk angle showed a high negative correlation with the starting speed. The trunk angle (TO4, TD4, TO6, TD6), hip angle (TO1, TO4, TO6), and knee angle (TD1) showed moderate to high negative correlations with the effectiveness of the starting movement during the starting and sprinting stages. The knee angle (TD2), ice-contact angle (TD2, TD4, TD5, TD6), and propulsion angle (TO1, TO4, TO7) showed moderate positive correlations with the effectiveness of the starting movement. Conclusion Stride length, left knee angle, and post-stability angle are the key factors affecting the starting speed. The larger the post-stability angle and left knee angle and the longer the stride length, the faster the starting speed. During the starting and sprinting stages, the smaller the ice-contact angle and propulsion angle, the greater the trunk angle and hip angle changes, the more effective the starting movement.

[AI-47] Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design

链接: https://arxiv.org/abs/2502.14944
作者: Masatoshi Uehara,Xingyu Su,Yulai Zhao,Xiner Li,Aviv Regev,Shuiwang Ji,Sergey Levine,Tommaso Biancalani
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review. If you have any suggestions/missing references, please let us know

点击查看摘要

Abstract:To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for reward-guided generation have been recently proposed due to their significance, current approaches predominantly focus on single-shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference-time reward optimization with diffusion models inspired by evolutionary algorithms. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Besides, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and cell-type-specific regulatory DNA design. The code is available at \hrefthis https URLthis https URL.

[AI-48] Fast and Accurate Blind Flexible Docking ICLR2025

链接: https://arxiv.org/abs/2502.14934
作者: Zizhuo Zhang,Lijun Wu,Kaiyuan Gao,Jiangchao Yao,Tao Qin,Bo Han
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages, Accepted by ICLR 2025

点击查看摘要

Abstract:Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex’s architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking-pocket identification, ligand conformation prediction, and protein flexibility modeling-into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 \times ) compared to existing state-of-the-art methods. Our code is released at this https URL.

[AI-49] Is Mathematics Obsolete?

链接: https://arxiv.org/abs/2502.14874
作者: Jeremy Avigad
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This is an essay about the value of mathematical and symbolic reasoning in the age of AI.

机器学习

[LG-0] sting the limits of fine-tuning to improve reasoning in vision language models

链接: https://arxiv.org/abs/2502.15678
作者: Luca M. Schulze Buschoff,Konstantinos Voudouris,Elif Akata,Matthias Bethge,Joshua B. Tenenbaum,Eric Schulz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

[LG-1] Predicting gene essentiality and drug response from perturbation screens in preclinical cancer models with LEAP: Layered Ensemble of Autoencoders and Predictors

链接: https://arxiv.org/abs/2502.15646
作者: Barbara Bodinier,Gaetan Dissez,Linus Bleistein,Antonin Dauvin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preclinical perturbation screens, where the effects of genetic, chemical, or environmental perturbations are systematically tested on disease models, hold significant promise for machine learning-enhanced drug discovery due to their scale and causal nature. Predictive models can infer perturbation responses for previously untested disease models based on molecular profiles. These in silico labels can expand databases and guide experimental prioritization. However, modelling perturbation-specific effects and generating robust prediction performances across diverse biological contexts remain elusive. We introduce LEAP (Layered Ensemble of Autoencoders and Predictors), a novel ensemble framework to improve robustness and generalization. LEAP leverages multiple DAMAE (Data Augmented Masked Autoencoder) representations and LASSO regressors. By combining diverse gene expression representation models learned from different random initializations, LEAP consistently outperforms state-of-the-art approaches in predicting gene essentiality or drug responses in unseen cell lines, tissues and disease models. Notably, our results show that ensembling representation models, rather than prediction models alone, yields superior predictive performance. Beyond its performance gains, LEAP is computationally efficient, requires minimal hyperparameter tuning and can therefore be readily incorporated into drug discovery pipelines to prioritize promising targets and support biomarker-driven stratification. The code and datasets used in this work are made publicly available. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.15646 [cs.LG] (or arXiv:2502.15646v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.15646 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] raining Neural ODEs Using Fully Discretized Simultaneous Optimization

链接: https://arxiv.org/abs/2502.15642
作者: Mariia Shapovalova,Calvin Tsay
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted to the 14th IFAC Symposium on Dynamics and Control of Process Systems, including Biosystems (DYCOPS 2025)

点击查看摘要

Abstract:Neural Ordinary Differential Equations (Neural ODEs) represent continuous-time dynamics with neural networks, offering advancements for modeling and control tasks. However, training Neural ODEs requires solving differential equations at each epoch, leading to high computational costs. This work investigates simultaneous optimization methods as a faster training alternative. In particular, we employ a collocation-based, fully discretized formulation and use IPOPT–a solver for large-scale nonlinear optimization–to simultaneously optimize collocation coefficients and neural network parameters. Using the Van der Pol Oscillator as a case study, we demonstrate faster convergence compared to traditional training methods. Furthermore, we introduce a decomposition framework utilizing Alternating Direction Method of Multipliers (ADMM) to effectively coordinate sub-models among data batches. Our results show significant potential for (collocation-based) simultaneous Neural ODE training pipelines.

[LG-3] Model Privacy: A Unified Framework to Understand Model Stealing Attacks and Defenses

链接: https://arxiv.org/abs/2502.15567
作者: Ganghua Wang,Yuhong Yang,Jie Ding
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The use of machine learning (ML) has become increasingly prevalent in various domains, highlighting the importance of understanding and ensuring its safety. One pressing concern is the vulnerability of ML applications to model stealing attacks. These attacks involve adversaries attempting to recover a learned model through limited query-response interactions, such as those found in cloud-based services or on-chip artificial intelligence interfaces. While existing literature proposes various attack and defense strategies, these often lack a theoretical foundation and standardized evaluation criteria. In response, this work presents a framework called ``Model Privacy’', providing a foundation for comprehensively analyzing model stealing attacks and defenses. We establish a rigorous formulation for the threat model and objectives, propose methods to quantify the goodness of attack and defense strategies, and analyze the fundamental tradeoffs between utility and privacy in ML models. Our developed theory offers valuable insights into enhancing the security of ML models, especially highlighting the importance of the attack-specific structure of perturbations for effective defenses. We demonstrate the application of model privacy from the defender’s perspective through various learning scenarios. Extensive experiments corroborate the insights and the effectiveness of defense mechanisms developed under the proposed framework.

[LG-4] A Defensive Framework Against Adversarial Attacks on Machine Learning-Based Network Intrusion Detection Systems

链接: https://arxiv.org/abs/2502.15561
作者: Benyamin Tafreshian,Shengzhi Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to IEEE AI+ TrustCom 2024

点击查看摘要

Abstract:As cyberattacks become increasingly sophisticated, advanced Network Intrusion Detection Systems (NIDS) are critical for modern network security. Traditional signature-based NIDS are inadequate against zero-day and evolving attacks. In response, machine learning (ML)-based NIDS have emerged as promising solutions; however, they are vulnerable to adversarial evasion attacks that subtly manipulate network traffic to bypass detection. To address this vulnerability, we propose a novel defensive framework that enhances the robustness of ML-based NIDS by simultaneously integrating adversarial training, dataset balancing techniques, advanced feature engineering, ensemble learning, and extensive model fine-tuning. We validate our framework using the NSL-KDD and UNSW-NB15 datasets. Experimental results show, on average, a 35% increase in detection accuracy and a 12.5% reduction in false positives compared to baseline models, particularly under adversarial conditions. The proposed defense against adversarial attacks significantly advances the practical deployment of robust ML-based NIDS in real-world networks.

[LG-5] Solving Inverse Problems with Deep Linear Neural Networks: Global Convergence Guarantees for Gradient Descent with Weight Decay

链接: https://arxiv.org/abs/2502.15522
作者: Hannah Laus,Suzanna Parkinson,Vasileios Charisopoulos,Felix Krahmer,Rebecca Willett
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Machine learning methods are commonly used to solve inverse problems, wherein an unknown signal must be estimated from few measurements generated via a known acquisition procedure. In particular, neural networks perform well empirically but have limited theoretical guarantees. In this work, we study an underdetermined linear inverse problem that admits several possible solution mappings. A standard remedy (e.g., in compressed sensing) establishing uniqueness of the solution mapping is to assume knowledge of latent low-dimensional structure in the source signal. We ask the following question: do deep neural networks adapt to this low-dimensional structure when trained by gradient descent with weight decay regularization? We prove that mildly overparameterized deep linear networks trained in this manner converge to an approximate solution that accurately solves the inverse problem while implicitly encoding latent subspace structure. To our knowledge, this is the first result to rigorously show that deep linear networks trained with weight decay automatically adapt to latent subspace structure in the data under practical stepsize and weight initialization schemes. Our work highlights that regularization and overparameterization improve generalization, while overparameterization also accelerates convergence during training.

[LG-6] SALSA-RL: Stability Analysis in the Latent Space of Actions for Reinforcement Learning

链接: https://arxiv.org/abs/2502.15512
作者: Xuyang Li,Romit Maulik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern deep reinforcement learning (DRL) methods have made significant advances in handling continuous action spaces. However, real-world control systems–especially those requiring precise and reliable performance–often demand formal stability, and existing DRL approaches typically lack explicit mechanisms to ensure or analyze stability. To address this limitation, we propose SALSA-RL (Stability Analysis in the Latent Space of Actions), a novel RL framework that models control actions as dynamic, time-dependent variables evolving within a latent space. By employing a pre-trained encoder-decoder and a state-dependent linear system, our approach enables both stability analysis and interpretability. We demonstrated that SALSA-RL can be deployed in a non-invasive manner for assessing the local stability of actions from pretrained RL agents without compromising on performance across diverse benchmark environments. By enabling a more interpretable analysis of action generation, SALSA-RL provides a powerful tool for advancing the design, analysis, and theoretical understanding of RL systems.

[LG-7] Verification and Validation for Trustworthy Scientific Machine Learning

链接: https://arxiv.org/abs/2502.15496
作者: John D. Jakeman,Lorena A. Barba,Joaquim R. R. A. Martins,Thomas O’Leary-Roseberry
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scientific machine learning (SciML) models are transforming many scientific disciplines. However, the development of good modeling practices to increase the trustworthiness of SciML has lagged behind its application, limiting its potential impact. The goal of this paper is to start a discussion on establishing consensus-based good practices for predictive SciML. We identify key challenges in applying existing computational science and engineering guidelines, such as verification and validation protocols, and provide recommendations to address these challenges. Our discussion focuses on predictive SciML, which uses machine learning models to learn, improve, and accelerate numerical simulations of physical systems. While centered on predictive applications, our 16 recommendations aim to help researchers conduc

[LG-8] Network Resource Optimization for ML-Based UAV Condition Monitoring with Vibration Analysis

链接: https://arxiv.org/abs/2502.15491
作者: Alexandre Gemayel,Dimitrios Michael Manias,Abdallah Shami
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: Accepted for publication in IEEE Networking Letters

点击查看摘要

Abstract:As smart cities begin to materialize, the role of Unmanned Aerial Vehicles (UAVs) and their reliability becomes increasingly important. One aspect of reliability relates to Condition Monitoring (CM), where Machine Learning (ML) models are leveraged to identify abnormal and adverse conditions. Given the resource-constrained nature of next-generation edge networks, the utilization of precious network resources must be minimized. This work explores the optimization of network resources for ML-based UAV CM frameworks. The developed framework uses experimental data and varies the feature extraction aggregation interval to optimize ML model selection. Additionally, by leveraging dimensionality reduction techniques, there is a 99.9% reduction in network resource consumption.

[LG-9] MoMa: A Modular Deep Learning Framework for Material Property Prediction

链接: https://arxiv.org/abs/2502.15483
作者: Botian Wang,Yawen Ouyang,Yaohui Li,Yiqun Wang,Haorui Cui,Jianbing Zhang,Xiaonan Wang,Wei-Ying Ma,Hao Zhou
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Deep learning methods for material property prediction have been widely explored to advance materials discovery. However, the prevailing pre-train then fine-tune paradigm often fails to address the inherent diversity and disparity of material tasks. To overcome these challenges, we introduce MoMa, a Modular framework for Materials that first trains specialized modules across a wide range of tasks and then adaptively composes synergistic modules tailored to each downstream scenario. Evaluation across 17 datasets demonstrates the superiority of MoMa, with a substantial 14% average improvement over the strongest baseline. Few-shot and continual learning experiments further highlight MoMa’s potential for real-world applications. Pioneering a new paradigm of modular material learning, MoMa will be open-sourced to foster broader community collaboration.

[LG-10] Decoding for Punctured Convolutional and Turbo Codes: A Deep Learning Solution for Protocols Compliance

链接: https://arxiv.org/abs/2502.15475
作者: Yongli Yan,Linglong Dai
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Neural network-based decoding methods have shown promise in enhancing error correction performance, but traditional approaches struggle with the challenges posed by punctured codes. In particular, these methods fail to address the complexities of variable code rates and the need for protocol compatibility. This paper presents a unified Long Short-Term Memory (LSTM)-based decoding architecture specifically designed to overcome these challenges. The proposed method unifies punctured convolutional and Turbo codes. A puncture embedding mechanism integrates puncturing patterns directly into the network, enabling seamless adaptation to varying code rates, while balanced bit error rate training ensures robustness across different code lengths, rates, and channels, maintaining protocol flexibility. Extensive simulations in Additive White Gaussian Noise and Rayleigh fading channels demonstrate that the proposed approach outperforms conventional decoding techniques, providing significant improvements in decoding accuracy and robustness. These results underscore the potential of LSTM-based decoding as a promising solution for next-generation artificial intelligence powered communication systems.

[LG-11] Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLM s NEURIPS2024

链接: https://arxiv.org/abs/2502.15427
作者: Giulio Zizzo,Giandomenico Cornacchia,Kieran Fraser,Muhammad Zaid Hameed,Ambrish Rawat,Beat Buesser,Mark Purcell,Pin-Yu Chen,Prasanna Sattigeri,Kush Varshney
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: NeurIPS 2024, Safe Generative AI Workshop

点击查看摘要

Abstract:As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to. Additionally, we show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance compared to many state-of-the-art defences. Code is available at this https URL.

[LG-12] Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution AAAI2025

链接: https://arxiv.org/abs/2502.15403
作者: Carlos Eiras-Franco,Anna Hedström,Marina M.-C. Höhne
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Obtaining high-quality explanations of a model’s output enables developers to identify and correct biases, align the system’s behavior with human values, and ensure ethical compliance. Explainable Artificial Intelligence (XAI) practitioners rely on specific measures to gauge the quality of such explanations. These measures assess key attributes, such as how closely an explanation aligns with a model’s decision process (faithfulness), how accurately it pinpoints the relevant input features (localization), and its consistency across different cases (robustness). Despite providing valuable information, these measures do not fully address a critical practitioner’s concern: how does the quality of a given explanation compare to other potential explanations? Traditionally, the quality of an explanation has been assessed by comparing it to a randomly generated counterpart. This paper introduces an alternative: the Quality Gap Estimate (QGE). The QGE method offers a direct comparison to what can be viewed as the `inverse’ explanation, one that conceptually represents the antithesis of the original explanation. Our extensive testing across multiple model architectures, datasets, and established quality metrics demonstrates that the QGE method is superior to the traditional approach. Furthermore, we show that QGE enhances the statistical reliability of these quality assessments. This advance represents a significant step toward a more insightful evaluation of explanations that enables a more effective inspection of a model’s behavior.

[LG-13] Learning Chern Numbers of Topological Insulators with Gauge Equivariant Neural Networks

链接: https://arxiv.org/abs/2502.15376
作者: Longde Huang,Oleksandr Balabanov,Hampus Linander,Mats Granath,Daniel Persson,Jan E. Gerken
类目: Machine Learning (cs.LG); Mesoscale and Nanoscale Physics (cond-mat.mes-hall)
*备注:

点击查看摘要

Abstract:Equivariant network architectures are a well-established tool for predicting invariant or equivariant quantities. However, almost all learning problems considered in this context feature a global symmetry, i.e. each point of the underlying space is transformed with the same group element, as opposed to a local ``gauge’’ symmetry, where each point is transformed with a different group element, exponentially enlarging the size of the symmetry group. Gauge equivariant networks have so far mainly been applied to problems in quantum chromodynamics. Here, we introduce a novel application domain for gauge-equivariant networks in the theory of topological condensed matter physics. We use gauge equivariant networks to predict topological invariants (Chern numbers) of multiband topological insulators. The gauge symmetry of the network guarantees that the predicted quantity is a topological invariant. We introduce a novel gauge equivariant normalization layer to stabilize the training and prove a universal approximation theorem for our setup. We train on samples with trivial Chern number only but show that our models generalize to samples with non-trivial Chern number. We provide various ablations of our setup. Our code is available at this https URL.

[LG-14] Efficient and Provable Algorithms for Covariate Shift

链接: https://arxiv.org/abs/2502.15372
作者: Deeksha Adil,Jarosław Błasiok
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Covariate shift, a widely used assumption in tackling \it distributional shift (when training and test distributions differ), focuses on scenarios where the distribution of the labels conditioned on the feature vector is the same, but the distribution of features in the training and test data are different. Despite the significance and extensive work on covariate shift, theoretical guarantees for algorithms in this domain remain sparse. In this paper, we distill the essence of the covariate shift problem and focus on estimating the average \mathbbE_\tilde\mathbfx\sim p_\mathrmtest\mathbff(\tilde\mathbfx) , of any unknown and bounded function \mathbff , given labeled training samples (\mathbfx_i, \mathbff(\mathbfx_i)) , and unlabeled test samples \tilde\mathbfx_i ; this is a core subroutine for several widely studied learning problems. We give several efficient algorithms, with provable sample complexity and computational guarantees. Moreover, we provide the first rigorous analysis of algorithms in this space when \mathbff is unrestricted, laying the groundwork for developing a solid theoretical foundation for covariate shift problems.

[LG-15] Efficiently Solving Discounted MDPs with Predictions on Transition Matrices

链接: https://arxiv.org/abs/2502.15345
作者: Lixing Lyu,Jiashuo Jiang,Wang Chi Cheung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study infinite-horizon Discounted Markov Decision Processes (DMDPs) under a generative model. Motivated by the Algorithm with Advice framework Mitzenmacher and Vassilvitskii 2022, we propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs and improve sample complexity bounds. We focus on the DMDPs with N state-action pairs and discounted factor \gamma . Firstly, we provide an impossibility result that, without prior knowledge of the prediction accuracy, no sampling policy can compute an \epsilon -optimal policy with a sample complexity bound better than \tildeO((1-\gamma)^-3 N\epsilon^-2) , which matches the state-of-the-art minimax sample complexity bound with no prediction. In complement, we propose an algorithm based on minimax optimization techniques that leverages the prediction on the transition matrix. Our algorithm achieves a sample complexity bound depending on the prediction error, and the bound is uniformly better than \tildeO((1-\gamma)^-4 N \epsilon^-2) , the previous best result derived from convex optimization methods. These theoretical findings are further supported by our numerical experiments.

[LG-16] Learning with Limited Shared Information in Multi-agent Multi-armed Bandit

链接: https://arxiv.org/abs/2502.15338
作者: Junning Shao,Siwei Wang,Zhixuan Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent multi-armed bandit (MAMAB) is a classic collaborative learning model and has gained much attention in recent years. However, existing studies do not consider the case where an agent may refuse to share all her information with others, e.g., when some of the data contains personal privacy. In this paper, we propose a novel limited shared information multi-agent multi-armed bandit (LSI-MAMAB) model in which each agent only shares the information that she is willing to share, and propose the Balanced-ETC algorithm to help multiple agents collaborate efficiently with limited shared information. Our analysis shows that Balanced-ETC is asymptotically optimal and its average regret (on each agent) approaches a constant when there are sufficient agents involved. Moreover, to encourage agents to participate in this collaborative learning, an incentive mechanism is proposed to make sure each agent can benefit from the collaboration system. Finally, we present experimental results to validate our theoretical results.

[LG-17] ght Clusters Make Specialized Experts

链接: https://arxiv.org/abs/2502.15315
作者: Stefan K. Nielsen,Rachel S.Y. Teo,Laziz U. Abdullaev,Tan M. Nguyen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for language modeling and image recognition tasks in both clean and corrupted settings.

[LG-18] Hyperspherical Normalization for Scalable Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.15280
作者: Hojoon Lee,Youngdo Lee,Takuma Seno,Donghu Kim,Peter Stone,Jaegul Choo
类目: Machine Learning (cs.LG)
*备注: 50 pages. Preprint

点击查看摘要

Abstract:Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains. The code is available at this https URL.

[LG-19] owards a Reward-Free Reinforcement Learning Framework for Vehicle Control

链接: https://arxiv.org/abs/2502.15262
作者: Jielong Yang,Daoyuan Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning plays a crucial role in vehicle control by guiding agents to learn optimal control strategies through designing or learning appropriate reward signals. However, in vehicle control applications, rewards typically need to be manually designed while considering multiple implicit factors, which easily introduces human biases. Although imitation learning methods does not rely on explicit reward signals, they necessitate high-quality expert actions, which are often challenging to acquire. To address these issues, we propose a reward-free reinforcement learning framework (RFRLF). This framework directly learns the target states to optimize agent behavior through a target state prediction network (TSPN) and a reward-free state-guided policy network (RFSGPN), avoiding the dependence on manually designed reward signals. Specifically, the policy network is learned via minimizing the differences between the predicted state and the expert state. Experimental results demonstrate the effectiveness of the proposed RFRLF in controlling vehicle driving, showing its advantages in improving learning efficiency and adapting to reward-free environments.

[LG-20] Real-Time Moving Flock Detection in Pedestrian Trajectories Using Sequential Deep Learning Models

链接: https://arxiv.org/abs/2502.15252
作者: Amartaivan Sanjjamts,Hiroshi Morita,Togootogtokh Enkhtogtokh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding collective pedestrian movement is crucial for applications in crowd management, autonomous navigation, and human-robot interaction. This paper investigates the use of sequential deep learning models, including Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Transformers, for real-time flock detection in multi-pedestrian trajectories. Our proposed approach consists of a two-stage process: first, a pre-trained binary classification model is used for pairwise trajectory classification, and second, the learned representations are applied to identify multi-agent flocks dynamically. We validate our method using real-world group movement datasets, demonstrating its robustness across varying sequence lengths and diverse movement patterns. Experimental results indicate that our model consistently detects pedestrian flocks with high accuracy and stability, even in dynamic and noisy environments. Furthermore, we extend our approach to identify other forms of collective motion, such as convoys and swarms, paving the way for more comprehensive multi-agent behavior analysis. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.15252 [cs.LG] (or arXiv:2502.15252v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.15252 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-21] Steganographic Embeddings as an Effective Data Augmentation

链接: https://arxiv.org/abs/2502.15245
作者: Nicholas DiSalvo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 10 pages, 4 figures. For associated code and experiments, see this http URL this https URL

点击查看摘要

Abstract:Image Steganography is a cryptographic technique that embeds secret information into an image, ensuring the hidden data remains undetectable to the human eye while preserving the image’s original visual integrity. Least Significant Bit (LSB) Steganography achieves this by replacing the k least significant bits of an image with the k most significant bits of a secret image, maintaining the appearance of the original image while simultaneously encoding the essential elements of the hidden data. In this work, we shift away from conventional applications of steganography in deep learning and explore its potential from a new angle. We present experimental results on CIFAR-10 showing that LSB Steganography, when used as a data augmentation strategy for downstream computer vision tasks such as image classification, can significantly improve the training efficiency of deep neural networks. It can also act as an implicit, uniformly discretized piecewise linear approximation of color augmentations such as (brightness, contrast, hue, and saturation), without introducing additional training overhead through a new joint image training regime that disregards the need for tuning sensitive augmentation hyperparameters.

[LG-22] Multi-agent Multi-armed Bandits with Minimum Reward Guarantee Fairness

链接: https://arxiv.org/abs/2502.15240
作者: Piyushi Manupriya,Himanshu,SakethaNath Jagarlapudi,Ganesh Ghalme
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We investigate the problem of maximizing social welfare while ensuring fairness in a multi-agent multi-armed bandit (MA-MAB) setting. In this problem, a centralized decision-maker takes actions over time, generating random rewards for various agents. Our goal is to maximize the sum of expected cumulative rewards, a.k.a. social welfare, while ensuring that each agent receives an expected reward that is at least a constant fraction of the maximum possible expected reward. Our proposed algorithm, RewardFairUCB, leverages the Upper Confidence Bound (UCB) technique to achieve sublinear regret bounds for both fairness and social welfare. The fairness regret measures the positive difference between the minimum reward guarantee and the expected reward of a given policy, whereas the social welfare regret measures the difference between the social welfare of the optimal fair policy and that of the given policy. We show that RewardFairUCB algorithm achieves instance-independent social welfare regret guarantees of \tildeO(T^1/2) and a fairness regret upper bound of \tildeO(T^3/4) . We also give the lower bound of \Omega(\sqrtT) for both social welfare and fairness regret. We evaluate RewardFairUCB’s performance against various baseline and heuristic algorithms using simulated data and real world data, highlighting trade-offs between fairness and social welfare regrets. Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2502.15240 [cs.LG] (or arXiv:2502.15240v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.15240 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Graph-Based Deep Learning on Stereo EEG for Predicting Seizure Freedom in Epilepsy Patients

链接: https://arxiv.org/abs/2502.15198
作者: Artur Agaronyan,Syeda Abeera Amir,Nunthasiri Wittayanakorn,John Schreiber,Marius G. Linguraru,William Gaillard,Chima Oluigbo,Syed Muhammad Anwar
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Predicting seizure freedom is essential for tailoring epilepsy treatment. But accurate prediction remains challenging with traditional methods, especially with diverse patient populations. This study developed a deep learning-based graph neural network (GNN) model to predict seizure freedom from stereo electroencephalography (sEEG) data in patients with refractory epilepsy. We utilized high-quality sEEG data from 15 pediatric patients to train a deep learning model that can accurately predict seizure freedom outcomes and advance understanding of brain connectivity at the seizure onset zone. Our model integrates local and global connectivity using graph convolutions with multi-scale attention mechanisms to capture connections between difficult-to-study regions such as the thalamus and motor regions. The model achieved an accuracy of 92.4% in binary class analysis, 86.6% in patient-wise analysis, and 81.4% in multi-class analysis. Node and edge-level feature analysis highlighted the anterior cingulate and frontal pole regions as key contributors to seizure freedom outcomes. The nodes identified by our model were also more likely to coincide with seizure onset zones. Our findings underscore the potential of new connectivity-based deep learning models such as GNNs for enhancing the prediction of seizure freedom, predicting seizure onset zones, connectivity analysis of the brain during seizure, as well as informing AI-assisted personalized epilepsy treatment planning.

[LG-24] Optimizing Product Provenance Verification using Data Valuation Methods

链接: https://arxiv.org/abs/2502.15177
作者: Raquib Bin Yousuf,Hoang Anh Just,Shengzhe Xu,Brian Mayer,Victor Deklerck,Jakub Truszkowski,John C. Simeone,Jade Saunders,Chang-Tien Lu,Ruoxi Jia,Naren Ramakrishnan
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Determining and verifying product provenance remains a critical challenge in global supply chains, particularly as geopolitical conflicts and shifting borders create new incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested timber or stolen agricultural products. Stable Isotope Ratio Analysis (SIRA), combined with Gaussian process regression-based isoscapes, has emerged as a powerful tool for geographic origin verification. However, the effectiveness of these models is often constrained by data scarcity and suboptimal dataset selection. In this work, we introduce a novel data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in SIRA. By prioritizing high-informative samples, our approach improves model robustness and predictive accuracy across diverse datasets and geographies. We validate our methodology with extensive experiments, demonstrating its potential to significantly enhance provenance verification, mitigate fraudulent trade practices, and strengthen regulatory enforcement of global supply chains.

[LG-25] Data Complexity Measures for Quantum Circuits Architecture Recommendation

链接: https://arxiv.org/abs/2502.15129
作者: Fernando M de Paula Neto
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum Parametric Circuits are constructed as an alternative to reduce the size of quantum circuits, meaning to decrease the number of quantum gates and, consequently, the depth of these circuits. However, determining the optimal circuit for a given problem remains an open question. Testing various combinations is challenging due to the infinite possibilities. In this work, a quantum circuit recommendation architecture for classification problems is proposed using database complexity measures. A quantum circuit is defined based on a circuit layer and the number of times this layer is iterated. Fourteen databases of varying dimensions and different numbers of classes were used to evaluate six quantum circuits, each with 1, 2, 3, 4, 8, and 16-layer repetitions. Using data complexity measures from the databases, it was possible to identify the optimal circuit capable of solving all problems with up to 100 % accuracy. Furthermore, with a mean absolute error of 0.80 \pm 2.17, one determined the appropriate number of layer repetitions, allowing for an error margin of up to three additional layers. Sixteen distinct machine learning models were employed for the selection of quantum circuits, alongside twelve classical regressor models to dynamically define the number of layers.

[LG-26] Curvature Corrected Nonnegative Manifold Data Factorization

链接: https://arxiv.org/abs/2502.15124
作者: Joyce Chew,Willem Diepeveen,Deanna Needell
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:Data with underlying nonlinear structure are collected across numerous application domains, necessitating new data processing and analysis methods adapted to nonlinear domain structure. Riemannanian manifolds present a rich environment in which to develop such tools, as manifold-valued data arise in a variety of scientific settings, and Riemannian geometry provides a solid theoretical grounding for geometric data analysis. Low-rank approximations, such as nonnegative matrix factorization (NMF), are the foundation of many Euclidean data analysis methods, so adaptations of these factorizations for manifold-valued data are important building blocks for further development of manifold data analysis. In this work, we propose curvature corrected nonnegative manifold data factorization (CC-NMDF) as a geometry-aware method for extracting interpretable factors from manifold-valued data, analogous to nonnegative matrix factorization. We develop an efficient iterative algorithm for computing CC-NMDF and demonstrate our method on real-world diffusion tensor magnetic resonance imaging data.

[LG-27] MONSTER: Monash Scalable Time Series Evaluation Repository

链接: https://arxiv.org/abs/2502.15122
作者: Angus Dempster,Navid Mohammadi Foumani,Chang Wei Tan,Lynn Miller,Amish Mishra,Mahsa Salehi,Charlotte Pelletier,Daniel F. Schmidt,Geoffrey I. Webb
类目: Machine Learning (cs.LG)
*备注: 45 pages; 38 figures

点击查看摘要

Abstract:We introduce MONSTER-the MONash Scalable Time Series Evaluation Repository-a collection of large datasets for time series classification. The field of time series classification has benefitted from common benchmarks set by the UCR and UEA time series classification repositories. However, the datasets in these benchmarks are small, with median sizes of 217 and 255 examples, respectively. In consequence they favour a narrow subspace of models that are optimised to achieve low classification error on a wide variety of smaller datasets, that is, models that minimise variance, and give little weight to computational issues such as scalability. Our hope is to diversify the field by introducing benchmarks using larger datasets. We believe that there is enormous potential for new progress in the field by engaging with the theoretical and practical challenges of learning effectively from larger quantities of data.

[LG-28] Leverag ing ChatGPT for Sponsored Ad Detection and Keyword Extraction in YouTube Videos

链接: https://arxiv.org/abs/2502.15102
作者: Brice Valentin Kok-Shun,Johnny Chan
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, accepted and presented in the 10th IEEE International Conference on Sustainable Technology and Engineering

点击查看摘要

Abstract:This work-in-progress paper presents a novel approach to detecting sponsored advertisement segments in YouTube videos and comparing the advertisement with the main content. Our methodology involves the collection of 421 auto-generated and manual transcripts which are then fed into a prompt-engineered GPT-4o for ad detection, a KeyBERT for keyword extraction, and another iteration of ChatGPT for category identification. The results revealed a significant prevalence of product-related ads across various educational topics, with ad categories refined using GPT-4o into succinct 9 content and 4 advertisement categories. This approach provides a scalable and efficient alternative to traditional ad detection methods while offering new insights into the types and relevance of ads embedded within educational content. This study highlights the potential of LLMs in transforming ad detection processes and improving our understanding of advertisement strategies in digital media.

[LG-29] More for Keys Less for Values: Adaptive KV Cache Quantization

链接: https://arxiv.org/abs/2502.15075
作者: Mohsen Hariri,Lam Nguyen,Sixu Chen,Shaochen Zhong,Qifan Wang,Xia Hu,Xiaotian Han,Vipin Chaudhary
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces an information-aware quantization framework that adaptively compresses the key-value (KV) cache in large language models (LLMs). Although prior work has underscored the distinct roles of key and value cache during inference, our systematic analysis – examining singular value distributions, spectral norms, and Frobenius norms – reveals, for the first time, that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices. Furthermore, our theoretical analysis shows that matrices with higher spectral norms amplify quantization errors more significantly. Motivated by these insights, we propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bit-width for keys and fewer for values since key matrices have higher norm values. With the same total KV bit budget, this approach effectively mitigates error propagation across transformer layers while achieving significant memory savings. Our extensive experiments on multiple LLMs (1B–70B) demonstrate that our mixed-precision quantization scheme maintains high model accuracy even under aggressive compression. For instance, using 4-bit for Key and 2-bit for Value achieves an accuracy of 75.2%, whereas reversing the assignment (2-bit for Key and 4-bit for Value) yields only 54.7% accuracy. The code is available at this https URL

[LG-30] Visualizing Machine Learning Models for Enhanced Financial Decision-Making and Risk Management

链接: https://arxiv.org/abs/2502.15073
作者: Priyam Ganguly,Ramakrishna Garine,Isha Mukherjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study emphasizes how crucial it is to visualize machine learning models, especially for the banking industry, in order to improve interpretability and support predictions in high stakes financial settings. Visual tools enable performance improvements and support the creation of innovative financial models by offering crucial insights into the algorithmic decision-making processes. Within a financial machine learning framework, the research uses visually guided experiments to make important concepts, such risk assessment and portfolio allocation, more understandable. The study also examines variations in trading tactics and how they relate to risk appetite, coming to the conclusion that the frequency of portfolio rebalancing is negatively correlated with risk tolerance. Finding these ideas is made possible in large part by visualization. The study concludes by presenting a novel method of locally stochastic asset weighing, where visualization facilitates data extraction and validation. This highlights the usefulness of these methods in furthering the field of financial machine learning research.

[LG-31] GiGL: Large-Scale Graph Neural Networks at Snapchat

链接: https://arxiv.org/abs/2502.15054
作者: Tong Zhao,Yozen Liu,Matthew Kolodner,Kyle Montemayor,Elham Ghazizadeh,Ankit Batra,Zihao Fan,Xiaobin Gao,Xuan Guo,Jiwen Ren,Serim Park,Peicheng Yu,Jun Yu,Shubham Vij,Neil Shah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in graph machine learning (ML) with the introduction of Graph Neural Networks (GNNs) have led to a widespread interest in applying these approaches to business applications at scale. GNNs enable differentiable end-to-end (E2E) learning of model parameters given graph structure which enables optimization towards popular node, edge (link) and graph-level tasks. While the research innovation in new GNN layers and training strategies has been rapid, industrial adoption and utility of GNNs has lagged considerably due to the unique scale challenges that large-scale graph ML problems create. In this work, we share our approach to training, inference, and utilization of GNNs at Snapchat. To this end, we present GiGL (Gigantic Graph Learning), an open-source library to enable large-scale distributed graph ML to the benefit of researchers, ML engineers, and practitioners. We use GiGL internally at Snapchat to manage the heavy lifting of GNN workflows, including graph data preprocessing from relational DBs, subgraph sampling, distributed training, inference, and orchestration. GiGL is designed to interface cleanly with open-source GNN modeling libraries prominent in academia like PyTorch Geometric (PyG), while handling scaling and productionization challenges that make it easier for internal practitioners to focus on modeling. GiGL is used in multiple production settings, and has powered over 35 launches across multiple business domains in the last 2 years in the contexts of friend recommendation, content recommendation and advertising. This work details high-level design and tools the library provides, scaling properties, case studies in diverse business settings with industry-scale graphs, and several key lessons learned in employing graph ML at scale on large social data. GiGL is open-sourced at this https URL.

[LG-32] Approximating Latent Manifolds in Neural Networks via Vanishing Ideals

链接: https://arxiv.org/abs/2502.15051
作者: Nico Pelleriti,Max Zimmer,Elias Wirth,Sebastian Pokutta
类目: Machine Learning (cs.LG)
*备注: 26 pages (8 main body, rest appendix and references), 12 figures, 3 tables, 3 algorithms

点击查看摘要

Abstract:Deep neural networks have reshaped modern machine learning by learning powerful latent representations that often align with the manifold hypothesis: high-dimensional data lie on lower-dimensional manifolds. In this paper, we establish a connection between manifold learning and computational algebra by demonstrating how vanishing ideals can characterize the latent manifolds of deep networks. To that end, we propose a new neural architecture that (i) truncates a pretrained network at an intermediate layer, (ii) approximates each class manifold via polynomial generators of the vanishing ideal, and (iii) transforms the resulting latent space into linearly separable features through a single polynomial layer. The resulting models have significantly fewer layers than their pretrained baselines, while maintaining comparable accuracy, achieving higher throughput, and utilizing fewer parameters. Furthermore, drawing on spectral complexity analysis, we derive sharper theoretical guarantees for generalization, showing that our approach can in principle offer tighter bounds than standard deep networks. Numerical experiments confirm the effectiveness and efficiency of the proposed approach.

[LG-33] GeoAggregator: An Efficient Transformer Model for Geo-Spatial Tabular Data AAAI2025

链接: https://arxiv.org/abs/2502.15032
作者: Rui Deng,Ziqi Li,Mingshu Wang
类目: Machine Learning (cs.LG)
*备注: Accepted in the main technical track of the AAAI 2025

点击查看摘要

Abstract:Modeling geospatial tabular data with deep learning has become a promising alternative to traditional statistical and machine learning approaches. However, existing deep learning models often face challenges related to scalability and flexibility as datasets grow. To this end, this paper introduces GeoAggregator, an efficient and lightweight algorithm based on transformer architecture designed specifically for geospatial tabular data modeling. GeoAggregators explicitly account for spatial autocorrelation and spatial heterogeneity through Gaussian-biased local attention and global positional awareness. Additionally, we introduce a new attention mechanism that uses the Cartesian product to manage the size of the model while maintaining strong expressive power. We benchmark GeoAggregator against spatial statistical models, XGBoost, and several state-of-the-art geospatial deep learning methods using both synthetic and empirical geospatial datasets. The results demonstrate that GeoAggregators achieve the best or second-best performance compared to their competitors on nearly all datasets. GeoAggregator’s efficiency is underscored by its reduced model size, making it both scalable and lightweight. Moreover, ablation experiments offer insights into the effectiveness of the Gaussian bias and Cartesian attention mechanism, providing recommendations for further optimizing the GeoAggregator’s performance.

[LG-34] Low degree conjecture implies sharp computational thresholds in stochastic block model

链接: https://arxiv.org/abs/2502.15024
作者: Jingqiu Ding,Yiding Hua,Lucas Slot,David Steurer
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注: 33 pages

点击查看摘要

Abstract:We investigate implications of the (extended) low-degree conjecture (recently formalized in [MW23]) in the context of the symmetric stochastic block model. Assuming the conjecture holds, we establish that no polynomial-time algorithm can weakly recover community labels below the Kesten-Stigum (KS) threshold. In particular, we rule out polynomial-time estimators that, with constant probability, achieve correlation with the true communities that is significantly better than random. Whereas, above the KS threshold, polynomial-time algorithms are known to achieve constant correlation with the true communities with high probability[Mas14,AS15]. To our knowledge, we provide the first rigorous evidence for the sharp transition in recovery rate for polynomial-time algorithms at the KS threshold. Notably, under a stronger version of the low-degree conjecture, our lower bound remains valid even when the number of blocks diverges. Furthermore, our results provide evidence of a computational-to-statistical gap in learning the parameters of stochastic block models. In contrast to prior work, which either (i) rules out polynomial-time algorithms for hypothesis testing with 1-o(1) success probability [Hopkins18, BBK+21a] under the low-degree conjecture, or (ii) rules out low-degree polynomials for learning the edge connection probability matrix [LG23], our approach provides stronger lower bounds on the recovery and learning problem. Our proof combines low-degree lower bounds from [Hopkins18, BBK+21a] with graph splitting and cross-validation techniques. In order to rule out general recovery algorithms, we employ the correlation preserving projection method developed in [HS17]. Comments: 33 pages Subjects: Computational Complexity (cs.CC); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO) Cite as: arXiv:2502.15024 [cs.CC] (or arXiv:2502.15024v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2502.15024 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jingqiu Ding [view email] [v1] Thu, 20 Feb 2025 20:21:03 UTC (30 KB)

[LG-35] Interpreting Adversarial Attacks and Defences using Architectures with Enhanced Interpretability AAAI

链接: https://arxiv.org/abs/2502.15017
作者: Akshay G Rao,Chandrashekhar Lakshminarayanan,Arun Rajkumar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Publication accepted at AAAI Deployable AI conference 2025 (proof - this https URL ) Total 17 pages

点击查看摘要

Abstract:Adversarial attacks in deep learning represent a significant threat to the integrity and reliability of machine learning models. Adversarial training has been a popular defence technique against these adversarial attacks. In this work, we capitalize on a network architecture, namely Deep Linearly Gated Networks (DLGN), which has better interpretation capabilities than regular deep network architectures. Using this architecture, we interpret robust models trained using PGD adversarial training and compare them with standard training. Feature networks in DLGN act as feature extractors, making them the only medium through which an adversary can attack the model. We analyze the feature network of DLGN with fully connected layers with respect to properties like alignment of the hyperplanes, hyperplane relation with PCA, and sub-network overlap among classes and compare these properties between robust and standard models. We also consider this architecture having CNN layers wherein we qualitatively (using visualizations) and quantitatively contrast gating patterns between robust and standard models. We uncover insights into hyperplanes resembling principal components in PGD-AT and STD-TR models, with PGD-AT hyperplanes aligned farther from the data points. We use path activity analysis to show that PGD-AT models create diverse, non-overlapping active subnetworks across classes, preventing attack-induced gating overlaps. Our visualization ideas show the nature of representations learnt by PGD-AT and STD-TR models.

[LG-36] meDistill: Efficient Long-Term Time Series Forecasting with MLP via Cross-Architecture Distillation

链接: https://arxiv.org/abs/2502.15016
作者: Juntong Ni,Zewen Liu,Shiyu Wang,Ming Jin,Wei Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based and CNN-based methods demonstrate strong performance in long-term time series forecasting. However, their high computational and storage requirements can hinder large-scale deployment. To address this limitation, we propose integrating lightweight MLP with advanced architectures using knowledge distillation (KD). Our preliminary study reveals different models can capture complementary patterns, particularly multi-scale and multi-period patterns in the temporal and frequency domains. Based on this observation, we introduce TimeDistill, a cross-architecture KD framework that transfers these patterns from teacher models (e.g., Transformers, CNNs) to MLP. Additionally, we provide a theoretical analysis, demonstrating that our KD approach can be interpreted as a specialized form of mixup data augmentation. TimeDistill improves MLP performance by up to 18.6%, surpassing teacher models on eight datasets. It also achieves up to 7X faster inference and requires 130X fewer parameters. Furthermore, we conduct extensive evaluations to highlight the versatility and effectiveness of TimeDistill.

[LG-37] Accelerating Neural Network Training: An Analysis of the AlgoPerf Competition ICLR2025

链接: https://arxiv.org/abs/2502.15015
作者: Priya Kasimbeg,Frank Schneider,Runa Eschenhagen,Juhan Bae,Chandramouli Shama Sastry,Mark Saroufim,Boyuan Feng,Less Wright,Edward Z. Yang,Zachary Nado,Sourabh Medapati,Philipp Hennig,Michael Rabbat,George E. Dahl
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICLR 2025; 23 pages, 5 figures, 8 tables

点击查看摘要

Abstract:The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by improving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition’s results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

[LG-38] Understanding the Design Principles of Link Prediction in Directed Settings

链接: https://arxiv.org/abs/2502.15008
作者: Jun Zhai,Muberra Ozmen,Thomas Markovich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Link prediction is a widely studied task in Graph Representation Learning (GRL) for modeling relational data. The early theories in GRL were based on the assumption of a symmetric adjacency matrix, reflecting an undirected setting. As a result, much of the following state-of-the-art research has continued to operate under this symmetry assumption, even though real-world data often involve crucial information conveyed through the direction of relationships. This oversight limits the ability of these models to fully capture the complexity of directed interactions. In this paper, we focus on the challenge of directed link prediction by evaluating key heuristics that have been successful in undirected settings. We propose simple but effective adaptations of these heuristics to the directed link prediction task and demonstrate that these modifications produce competitive performance compared to the leading Graph Neural Networks (GNNs) originally designed for undirected graphs. Through an extensive set of experiments, we derive insights that inform the development of a novel framework for directed link prediction, which not only surpasses baseline methods but also outperforms state-of-the-art GNNs on multiple benchmarks.

[LG-39] Generative Modeling of Individual Behavior at Scale

链接: https://arxiv.org/abs/2502.14998
作者: Nabil Omi,Lucas Caccia,Anurag Sarkar,Jordan T. Ash,Siddhartha Sen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There has been a growing interest in using AI to model human behavior, particularly in domains where humans interact with this technology. While most existing work models human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent approaches to behavioral stylometry – or the task of identifying a person from their actions alone – have shown promise in domains like chess, but these approaches are either not scalable (e.g., fine-tune a separate model for each person) or not generative, in that they cannot generate actions. We address these limitations by framing behavioral stylometry as a multi-task learning problem – where each task represents a distinct person – and use parameter-efficient fine-tuning (PEFT) methods to learn an explicit style vector for each person. Style vectors are generative: they selectively activate shared “skill” parameters to generate actions in the style of each person. They also induce a latent space that we can interpret and manipulate algorithmically. In particular, we develop a general technique for style steering that allows us to steer a player’s style vector towards a desired property. We apply our approach to two very different games, at unprecedented scales: chess (47,864 players) and Rocket League (2,000 players). We also show generality beyond gaming by applying our method to image generation, where we learn style vectors for 10,177 celebrities and use these vectors to steer their images.

[LG-40] P2W: From Power Traces to Weights Matrix – An Unconventional Transfer Learning Approach

链接: https://arxiv.org/abs/2502.14968
作者: Roozbeh Siyadatzadeh,Fatemeh Mehrafrooz,Nele Mentens,Todor Stefanov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of deploying machine learning (ML) models within embedded systems on a chip (SoCs) has led to transformative shifts in fields like healthcare and autonomous vehicles. One of the primary challenges for training such embedded ML models is the lack of publicly available high-quality training data. Transfer learning approaches address this challenge by utilizing the knowledge encapsulated in an existing ML model as a starting point for training a new ML model. However, existing transfer learning approaches require direct access to the existing model which is not always feasible, especially for ML models deployed on embedded SoCs. Therefore, in this paper, we introduce a novel unconventional transfer learning approach to train a new ML model by extracting and using weights from an existing ML model running on an embedded SoC without having access to the model within the SoC. Our approach captures power consumption measurements from the SoC while it is executing the ML model and translates them to an approximated weights matrix used to initialize the new ML model. This improves the learning efficiency and predictive performance of the new model, especially in scenarios with limited data available to train the model. Our novel approach can effectively increase the accuracy of the new ML model up to 3 times compared to classical training methods using the same amount of limited training data.

[LG-41] Sparks of cognitive flexibility: self-guided context inference for flexible stimulus-response mapping by attentional routing

链接: https://arxiv.org/abs/2502.15634
作者: Rowan Sommers,Sushrut Thorat,Daniel Anthes,Tim C. Kietzmann
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:Flexible cognition demands discovering hidden rules to quickly adapt stimulus-response mappings. Standard neural networks struggle in tasks requiring rapid, context-driven remapping. Recently, Hummos (2023) introduced a fast-and-slow learning algorithm to mitigate this shortfall, but its scalability to complex, image-computable tasks was unclear. Here, we propose the Wisconsin Neural Network (WiNN), which expands on fast-and-slow learning for real-world tasks demanding flexible rule-based behavior. WiNN employs a pretrained convolutional neural network for vision, coupled with an adjustable “context state” that guides attention to relevant features. If WiNN produces an incorrect response, it first iteratively updates its context state to refocus attention on task-relevant cues, then performs minimal parameter updates to attention and readout layers. This strategy preserves generalizable representations in the sensory network, reducing catastrophic forgetting. We evaluate WiNN on an image-based extension of the Wisconsin Card Sorting Task, revealing several markers of cognitive flexibility: (i) WiNN autonomously infers underlying rules, (ii) requires fewer examples to do so than control models reliant on large-scale parameter updates, (iii) can perform context-based rule inference solely via context-state adjustments-further enhanced by slow updates of attention and readout parameters, and (iv) generalizes to unseen compositional rules through context-state inference alone. By blending fast context inference with targeted attentional guidance, WiNN achieves “sparks” of flexibility. This approach offers a path toward context-sensitive models that retain knowledge while rapidly adapting to complex, rule-based tasks.

[LG-42] Context-Aware Doubly-Robust Semi-Supervised Learning

链接: https://arxiv.org/abs/2502.15577
作者: Clement Ruah,Houssem Sifaou,Osvaldo Simeone,Bashir Al-Hashimi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The widespread adoption of artificial intelligence (AI) in next-generation communication systems is challenged by the heterogeneity of traffic and network conditions, which call for the use of highly contextual, site-specific, data. A promising solution is to rely not only on real-world data, but also on synthetic pseudo-data generated by a network digital twin (NDT). However, the effectiveness of this approach hinges on the accuracy of the NDT, which can vary widely across different contexts. To address this problem, this paper introduces context-aware doubly-robust (CDR) learning, a novel semi-supervised scheme that adapts its reliance on the pseudo-data to the different levels of fidelity of the NDT across contexts. CDR is evaluated on the task of downlink beamforming, showing superior performance compared to previous state-of-the-art semi-supervised approaches.

[LG-43] Generalization Guarantees for Representation Learning via Data-Dependent Gaussian Mixture Priors ICLR2025

链接: https://arxiv.org/abs/2502.15540
作者: Milad Sefidgaran,Abdellatif Zaidi,Piotr Krasnowski
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted as a Spotlight Paper at ICLR 2025

点击查看摘要

Abstract:We establish in-expectation and tail bounds on the generalization error of representation learning type algorithms. The bounds are in terms of the relative entropy between the distribution of the representations extracted from the training and "test’’ datasets and a data-dependent symmetric prior, i.e., the Minimum Description Length (MDL) of the latent variables for the training and test datasets. Our bounds are shown to reflect the “structure” and "simplicity’’ of the encoder and significantly improve upon the few existing ones for the studied model. We then use our in-expectation bound to devise a suitable data-dependent regularizer; and we investigate thoroughly the important question of the selection of the prior. We propose a systematic approach to simultaneously learning a data-dependent Gaussian mixture prior and using it as a regularizer. Interestingly, we show that a weighted attention mechanism emerges naturally in this procedure. Our experiments show that our approach outperforms the now popular Variational Information Bottleneck (VIB) method as well as the recent Category-Dependent VIB (CDVIB).

[LG-44] Sheaf theory: from deep geometry to deep learning

链接: https://arxiv.org/abs/2502.15476
作者: Anton Ayzenberg,Thomas Gebhart,German Magai,Grigory Solomadin
类目: Algebraic Topology (math.AT); Computational Geometry (cs.CG); Machine Learning (cs.LG); K-Theory and Homology (math.KT)
*备注: 117 pages, 8 figures

点击查看摘要

Abstract:This paper provides an overview of the applications of sheaf theory in deep learning, data science, and computer science in general. The primary text of this work serves as a friendly introduction to applied and computational sheaf theory accessible to those with modest mathematical familiarity. We describe intuitions and motivations underlying sheaf theory shared by both theoretical researchers and practitioners, bridging classical mathematical theory and its more recent implementations within signal processing and deep learning. We observe that most notions commonly considered specific to cellular sheaves translate to sheaves on arbitrary posets, providing an interesting avenue for further generalization of these methods in applications, and we present a new algorithm to compute sheaf cohomology on arbitrary finite posets in response. By integrating classical theory with recent applications, this work reveals certain blind spots in current machine learning practices. We conclude with a list of problems related to sheaf-theoretic applications that we find mathematically insightful and practically instructive to solve. To ensure the exposition of sheaf theory is self-contained, a rigorous mathematical introduction is provided in appendices which moves from an introduction of diagrams and sheaves to the definition of derived functors, higher order cohomology, sheaf Laplacians, sheaf diffusion, and interconnections of these subjects therein.

[LG-45] Dimension-free bounds in high-dimensional linear regression via error-in-operator approach

链接: https://arxiv.org/abs/2502.15437
作者: Fedor Noskov,Nikita Puchkin,Vladimir Spokoiny
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 100 pages

点击查看摘要

Abstract:We consider a problem of high-dimensional linear regression with random design. We suggest a novel approach referred to as error-in-operator which does not estimate the design covariance \Sigma directly but incorporates it into empirical risk minimization. We provide an expansion of the excess prediction risk and derive non-asymptotic dimension-free bounds on the leading term and the remainder. This helps us to show that auxiliary variables do not increase the effective dimension of the problem, provided that parameters of the procedure are tuned properly. We also discuss computational aspects of our method and illustrate its performance with numerical experiments.

[LG-46] Fréchet Cumulative Covariance Net for Deep Nonlinear Sufficient Dimension Reduction with Random Objects

链接: https://arxiv.org/abs/2502.15374
作者: Hang Yuan,Christina Dan Wang,Zhou Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nonlinear sufficient dimension reduction\citeplibing_generalSDR, which constructs nonlinear low-dimensional representations to summarize essential features of high-dimensional data, is an important branch of representation learning. However, most existing methods are not applicable when the response variables are complex non-Euclidean random objects, which are frequently encountered in many recent statistical applications. In this paper, we introduce a new statistical dependence measure termed Fréchet Cumulative Covariance (FCCov) and develop a novel nonlinear SDR framework based on FCCov. Our approach is not only applicable to complex non-Euclidean data, but also exhibits robustness against outliers. We further incorporate Feedforward Neural Networks (FNNs) and Convolutional Neural Networks (CNNs) to estimate nonlinear sufficient directions in the sample level. Theoretically, we prove that our method with squared Frobenius norm regularization achieves unbiasedness at the \sigma -field level. Furthermore, we establish non-asymptotic convergence rates for our estimators based on FNNs and ResNet-type CNNs, which match the minimax rate of nonparametric regression up to logarithmic factors. Intensive simulation studies verify the performance of our methods in both Euclidean and non-Euclidean settings. We apply our method to facial expression recognition datasets and the results underscore more realistic and broader applicability of our proposal.

[LG-47] Drug-Target Interaction/Affinity Prediction: Deep Learning Models and Advances Review

链接: https://arxiv.org/abs/2502.15346
作者: Ali Vefghi,Zahed Rahmati,Mohammad Akbari
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 64 pages, 7 figures, 10 tables

点击查看摘要

Abstract:Drug discovery remains a slow and expensive process that involves many steps, from detecting the target structure to obtaining approval from the Food and Drug Administration (FDA), and is often riddled with safety concerns. Accurate prediction of how drugs interact with their targets and the development of new drugs by using better methods and technologies have immense potential to speed up this process, ultimately leading to faster delivery of life-saving medications. Traditional methods used for drug-target interaction prediction show limitations, particularly in capturing complex relationships between drugs and their targets. As an outcome, deep learning models have been presented to overcome the challenges of interaction prediction through their precise and efficient end results. By outlining promising research avenues and models, each with a different solution but similar to the problem, this paper aims to give researchers a better idea of methods for even more accurate and efficient prediction of drug-target interaction, ultimately accelerating the development of more effective drugs. A total of 180 prediction methods for drug-target interactions were analyzed throughout the period spanning 2016 to 2025 using different frameworks based on machine learning, mainly deep learning and graph neural networks. Additionally, this paper discusses the novelty, architecture, and input representation of these models.

[LG-48] Utilizing Sequential Information of General Lab-test Results and Diagnoses History for Differential Diagnosis of Dementia

链接: https://arxiv.org/abs/2502.15317
作者: Yizong Xing,Dhita Putri Pratama,Yuke Wang,Yufan Zhang,Brian E. Chapman
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Early diagnosis of Alzheimer’s Disease (AD) faces multiple data-related challenges, including high variability in patient data, limited access to specialized diagnostic tests, and overreliance on single-type indicators. These challenges are exacerbated by the progressive nature of AD, where subtle pathophysiological changes often precede clinical symptoms by decades. To address these limitations, this study proposes a novel approach that takes advantage of routinely collected general laboratory test histories for the early detection and differential diagnosis of AD. By modeling lab test sequences as “sentences”, we apply word embedding techniques to capture latent relationships between tests and employ deep time series models, including long-short-term memory (LSTM) and Transformer networks, to model temporal patterns in patient records. Experimental results demonstrate that our approach improves diagnostic accuracy and enables scalable and costeffective AD screening in diverse clinical settings.

[LG-49] A Data-Driven Real-Time Optimal Power Flow Algorithm Using Local Feedback

链接: https://arxiv.org/abs/2502.15306
作者: Heng Liang,Yujin Huang,Changhong Zhao
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing penetration of distributed energy resources (DERs) adds variability as well as fast control capabilities to power networks. Dispatching the DERs based on local information to provide real-time optimal network operation is the desideratum. In this paper, we propose a data-driven real-time algorithm that uses only the local measurements to solve time-varying AC optimal power flow (OPF). Specifically, we design a learnable function that takes the local feedback as input in the algorithm. The learnable function, under certain conditions, will result in a unique stationary point of the algorithm, which in turn transfers the OPF problems to be optimized over the parameters of the function. We then develop a stochastic primal-dual update to solve the variant of the OPF problems based on a deep neural network (DNN) parametrization of the learnable function, which is referred to as the training stage. We also design a gradient-free alternative to bypass the cumbersome gradient calculation of the nonlinear power flow model. The OPF solution-tracking error bound is established in the sense of universal approximation of DNN. Numerical results on the IEEE 37-bus test feeder show that the proposed method can track the time-varying OPF solutions with higher accuracy and faster computation compared to benchmark methods.

[LG-50] Comparative Analysis of Black Hole Mass Estimation in Type-2 AGNs: Classical vs. Quantum Machine Learning and Deep Learning Approaches

链接: https://arxiv.org/abs/2502.15297
作者: Sathwik Narkedimilli,Venkata Sriram Amballa,N V Saran Kumar,R Arun Kumar,R Praneeth Reddy,Satvik Raghav,Manish M,Aswath Babu H
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 29 pages, 12 Figures, 6 Tables

点击查看摘要

Abstract:In the case of Type-2 AGNs, estimating the mass of the black hole is challenging. Understanding how galaxies form and evolve requires considerable insight into the mass of black holes. This work compared different classical and quantum machine learning (QML) algorithms for black hole mass estimation, wherein the classical algorithms are Linear Regression, XGBoost Regression, Random Forest Regressor, Support Vector Regressor (SVR), Lasso Regression, Ridge Regression, Elastic Net Regression, Bayesian Regression, Decision Tree Regressor, Gradient Booster Regressor, Classical Neural Networks, Gated Recurrent Unit (GRU), LSTM, Deep Residual Networks (ResNets) and Transformer-Based Regression. On the other hand, quantum algorithms including Hybrid Quantum Neural Networks (QNN), Quantum Long Short-Term Memory (Q-LSTM), Sampler-QNN, Estimator-QNN, Variational Quantum Regressor (VQR), Quantum Linear Regression(Q-LR), QML with JAX optimization were also tested. The results revealed that classical algorithms gave better R^2, MAE, MSE, and RMSE results than the quantum models. Among the classical models, LSTM has the best result with an accuracy of 99.77%. Estimator-QNN has the highest accuracy for quantum algorithms with an MSE of 0.0124 and an accuracy of 99.75%. This study ascertains both the strengths and weaknesses of the classical and the quantum approaches. As far as our knowledge goes, this work could pave the way for the future application of quantum algorithms in astrophysical data analysis.

[LG-51] nsor Product Neural Networks for Functional ANOVA Model

链接: https://arxiv.org/abs/2502.15215
作者: Seokhun Park,Insung Kong,Yongchan Choi,Chanmoo Park,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 45 pages

点击查看摘要

Abstract:Interpretability for machine learning models is becoming more and more important as machine learning models become more complex. The functional ANOVA model, which decomposes a high-dimensional function into a sum of lower dimensional functions so called components, is one of the most popular tools for interpretable AI, and recently, various neural network models have been developed for estimating each component in the functional ANOVA model. However, such neural networks are highly unstable when estimating components since the components themselves are not uniquely defined. That is, there are multiple functional ANOVA decompositions for a given function. In this paper, we propose a novel interpretable model which guarantees a unique functional ANOVA decomposition and thus is able to estimate each component stably. We call our proposed model ANOVA-NODE since it is a modification of Neural Oblivious Decision Ensembles (NODE) for the functional ANOVA model. Theoretically, we prove that ANOVA-NODE can approximate a smooth function well. Additionally, we experimentally show that ANOVA-NODE provides much more stable estimation of each component and thus much more stable interpretation when training data and initial values of the model parameters vary than existing neural network models do.

[LG-52] Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling

链接: https://arxiv.org/abs/2502.15131
作者: Yufan Li,Pragya Sur
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the fundamental problem of calibrating a linear binary classifier of the form \sigma(\hatw^\top x) , where the feature vector x is Gaussian, \sigma is a link function, and \hatw is an estimator of the true linear weight w^\star . By interpolating with a noninformative \textitchance classifier , we construct a well-calibrated predictor whose interpolation weight depends on the angle \angle(\hatw, w_\star) between the estimator \hatw and the true linear weight w_\star . We establish that this angular calibration approach is provably well-calibrated in a high-dimensional regime where the number of samples and features both diverge, at a comparable rate. The angle \angle(\hatw, w_\star) can be consistently estimated. Furthermore, the resulting predictor is uniquely \textitBregman-optimal , minimizing the Bregman divergence to the true label distribution within a suitable class of calibrated predictors. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions. Additionally, we identify conditions under which a classical Platt-scaling predictor converges to our Bregman-optimal calibrated solution. Thus, Platt-scaling also inherits these desirable properties provably in high dimensions.

[LG-53] Do we really need the Rademacher complexities?

链接: https://arxiv.org/abs/2502.15118
作者: Daniel Bartl,Shahar Mendelson
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the fundamental problem of learning with respect to the squared loss in a convex class. The state-of-the-art sample complexity estimates in this setting rely on Rademacher complexities, which are generally difficult to control. We prove that, contrary to prevailing belief and under minimal assumptions, the sample complexity is not governed by the Rademacher complexities but rather by the behaviour of the limiting gaussian process. In particular, all such learning problems that have the same L_2 -structure – even those with heavy-tailed distributions – share the same sample complexity. This constitutes the first universality result for general convex learning problems. The proof is based on a novel learning procedure, and its performance is studied by combining optimal mean estimation techniques for real-valued random variables with Talagrand’s generic chaining method. Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.15118 [math.ST] (or arXiv:2502.15118v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2502.15118 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Variational phylogenetic inference with products over bipartitions

链接: https://arxiv.org/abs/2502.15110
作者: Evan Sidrow,Alexandre Bouchard-Côté,Lloyd T. Elliott
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Bayesian phylogenetics requires accurate and efficient approximation of posterior distributions over trees. In this work, we develop a variational Bayesian approach for ultrametric phylogenetic trees. We present a novel variational family based on coalescent times of a single-linkage clustering and derive a closed-form density of the resulting distribution over trees. Unlike existing methods for ultrametric trees, our method performs inference over all of tree space, it does not require any Markov chain Monte Carlo subroutines, and our variational family is differentiable. Through experiments on benchmark genomic datasets and an application to SARS-CoV-2, we demonstrate that our method achieves competitive accuracy while requiring significantly fewer gradient evaluations than existing state-of-the-art techniques.

[LG-55] Forecasting Local Ionospheric Parameters Using Transformers

链接: https://arxiv.org/abs/2502.15093
作者: Daniel J. Alford-Lago,Christopher W. Curtis,Alexander T. Ihler,Katherine A. Zawdie,Douglas P. Drob
类目: pace Physics (physics.space-ph); Machine Learning (cs.LG)
*备注: 47 pages, 42 figures

点击查看摘要

Abstract:We present a novel method for forecasting key ionospheric parameters using transformer-based neural networks. The model provides accurate forecasts and uncertainty quantification of the F2-layer peak plasma frequency (foF2), the F2-layer peak density height (hmF2), and total electron content (TEC) for a given geographic location. It supports a number of exogenous variables, including F10.7cm solar flux and disturbance storm time (Dst). We demonstrate how transformers can be trained in a data assimilation-like fashion that use these exogenous variables along with naïve predictions from climatology to generate 24-hour forecasts with non-parametric uncertainty bounds. We call this method the Local Ionospheric Forecast Transformer (LIFT). We demonstrate that the trained model can generalize to new geographic locations and time periods not seen during training, and we compare its performance to that of the International Reference Ionosphere (IRI).

[LG-56] Modifying Final Splits of Classification Tree for Fine-tuning Subpopulation Target in Policy Making

链接: https://arxiv.org/abs/2502.15072
作者: Lei Bill Wang,Zhenbang Jiao,Fangyi Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Policymakers often use Classification and Regression Trees (CART) to partition populations based on binary outcomes and target subpopulations whose probability of the binary event exceeds a threshold. However, classic CART and knowledge distillation method whose student model is a CART (referred to as KD-CART) do not minimize the misclassification risk associated with classifying the latent probabilities of these binary events. To reduce the misclassification risk, we propose two methods, Penalized Final Split (PFS) and Maximizing Distance Final Split (MDFS). PFS incorporates a tunable penalty into the standard CART splitting criterion function. MDFS maximizes a weighted sum of distances between node means and the threshold. It can point-identify the optimal split under the unique intersect latent probability assumption. In addition, we develop theoretical result for MDFS splitting rule estimation, which has zero asymptotic risk. Through extensive simulation studies, we demonstrate that these methods predominately outperform classic CART and KD-CART in terms of misclassification error. Furthermore, in our empirical evaluations, these methods provide deeper insights than the two baseline methods.

[LG-57] An Interpretable Machine Learning Approach to Understanding the Relationships between Solar Flares and Source Active Regions

链接: https://arxiv.org/abs/2502.15066
作者: Huseyin Cavus,Jason T. L. Wang,Teja P. S. Singampalli,Gani Caglar Coban,Hongyang Zhang,Abd-ur Raheem,Haimin Wang
类目: olar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Solar flares are defined as outbursts on the surface of the Sun. They occur when energy accumulated in magnetic fields enclosing solar active regions (ARs) is abruptly expelled. Solar flares and associated coronal mass ejections are sources of space weather that adversely impact devices at or near Earth, including the obstruction of high-frequency radio waves utilized for communication and the deterioration of power grid operations. Tracking and delivering early and precise predictions of solar flares is essential for readiness and catastrophe risk mitigation. This paper employs the random forest (RF) model to address the binary classification task, analyzing the links between solar flares and their originating ARs with observational data gathered from 2011 to 2021 by this http URL and the XRT flare database. We seek to identify the physical features of a source AR that significantly influence its potential to trigger =C-class flares. We found that the features of AR_Type_Today, Hale_Class_Yesterday are the most and the least prepotent features, respectively. NoS_Difference has a remarkable effect in decision-making in both global and local interpretations.

[LG-58] Symmetric observations without symmetric causal explanations WWW

链接: https://arxiv.org/abs/2502.14950
作者: Christian William,Patrick Remy,Jean-Daniel Bancal,Yu Cai,Nicolas Brunner,Alejandro Pozas-Kerstjens
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 8 pages, 4 figures, RevTeX 4.2. The computational appendix is available at this https URL

点击查看摘要

Abstract:Inferring causal models from observed correlations is a challenging task, crucial to many areas of science. In order to alleviate the effort, it is important to know whether symmetries in the observations correspond to symmetries in the underlying realization. Via an explicit example, we answer this question in the negative. We use a tripartite probability distribution over binary events that is realized by using three (different) independent sources of classical randomness. We prove that even removing the condition that the sources distribute systems described by classical physics, the requirements that i) the sources distribute the same physical systems, ii) these physical systems respect relativistic causality, and iii) the correlations are the observed ones, are incompatible.

[LG-59] Optimizing Gene-Based Testing for Antibiotic Resistance Prediction AAAI-25

链接: https://arxiv.org/abs/2502.14919
作者: David Hagerman,Anna Johnning,Roman Naeem,Fredrik Kahl,Erik Kristiansson,Lennart Svensson
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Accepted to AAAI-25 AISI

点击查看摘要

Abstract:Antibiotic Resistance (AR) is a critical global health challenge that necessitates the development of cost-effective, efficient, and accurate diagnostic tools. Given the genetic basis of AR, techniques such as Polymerase Chain Reaction (PCR) that target specific resistance genes offer a promising approach for predictive diagnostics using a limited set of key genes. This study introduces GenoARM, a novel framework that integrates reinforcement learning (RL) with transformer-based models to optimize the selection of PCR gene tests and improve AR predictions, leveraging observed metadata for improved accuracy. In our evaluation, we developed several high-performing baselines and compared them using publicly available datasets derived from real-world bacterial samples representing multiple clinically relevant pathogens. The results show that all evaluated methods achieve strong and reliable performance when metadata is not utilized. When metadata is introduced and the number of selected genes increases, GenoARM demonstrates superior performance due to its capacity to approximate rewards for unseen and sparse combinations. Overall, our framework represents a major advancement in optimizing diagnostic tools for AR in clinical settings.

[LG-60] owards an automated workflow in materials science for combining multi-modal simulative and experimental information using data mining and large language models

链接: https://arxiv.org/abs/2502.14904
作者: Balduin Katzer,Steffen Klinder,Katrin Schulz
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To retrieve and compare scientific data of simulations and experiments in materials science, data needs to be easily accessible and machine readable to qualify and quantify various materials science phenomena. The recent progress in open science leverages the accessibility to data. However, a majority of information is encoded within scientific documents limiting the capability of finding suitable literature as well as material properties. This manuscript showcases an automated workflow, which unravels the encoded information from scientific literature to a machine readable data structure of texts, figures, tables, equations and meta-data, using natural language processing and language as well as vision transformer models to generate a machine-readable database. The machine-readable database can be enriched with local data, as e.g. unpublished or private material data, leading to knowledge synthesis. The study shows that such an automated workflow accelerates information retrieval, proximate context detection and material property extraction from multi-modal input data exemplarily shown for the research field of microstructural analyses of face-centered cubic single crystals. Ultimately, a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) enables a fast and efficient question answering chat bot.

[LG-61] Applications of Random Matrix Theory in Machine Learning and Brain Mapping

链接: https://arxiv.org/abs/2502.14878
作者: Katrina Lawrence
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Brain mapping analyzes the wavelengths of brain signals and outputs them in a map, which is then analyzed by a radiologist. Introducing Machine Learning (ML) into the brain mapping process reduces the variable of human error in reading such maps and increases efficiency. A key area of interest is determining the correlation between the functional areas of the brain on a voxel (3-dimensional pixel) wise basis. This leads to determining how a brain is functioning and can be used to detect diseases, disabilities, and sicknesses. As such, random noise presents a challenge in consistently determining the actual signals from the scan. This paper discusses how an algorithm created by Random Matrix Theory (RMT) can be used as a tool for ML, as it detects the correlation of the functional areas of the brain. Random matrices are simulated to represent the voxel signal intensity strength for each time interval where a stimulus is presented in an fMRI scan. Using the Marchenko-Pastur law for Wishart Matrices, a result of RMT, it was found that no matter what type of noise was added to the random matrices, the observed eigenvalue distribution of the Wishart Matrices would converge to the theoretical distribution. This means that RMT is robust and has a high test-re-test reliability. These results further indicate that a strong correlation exists between the eigenvalues, and hence the functional regions of the brain. Any eigenvalue that differs significantly from those predicted from RMT may indicate the discovery of a new discrete brain network.

[LG-62] Exploring Quantum Control Landscape and Solution Space Complexity through Dimensionality Reduction Optimization Algorithms

链接: https://arxiv.org/abs/2502.11905
作者: Haftu W. Fentaw,Steve Campbell,Simon Caton
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the quantum control landscape (QCL) is important for designing effective quantum control strategies. In this study, we analyze the QCL for a single two-level quantum system (qubit) using various control strategies. We employ Principal Component Analysis (PCA), to visualize and analyze the QCL for higher dimensional control parameters. Our results indicate that dimensionality reduction techniques such as PCA, can play an important role in understanding the complex nature of quantum control in higher dimensions. Evaluations of traditional control techniques and machine learning algorithms reveal that Genetic Algorithms (GA) outperform Stochastic Gradient Descent (SGD), while Q-learning (QL) shows great promise compared to Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO). Additionally, our experiments highlight the importance of reward function design in DQN and PPO demonstrating that using immediate reward results in improved performance rather than delayed rewards for systems with short time steps. A study of solution space complexity was conducted by using Cluster Density Index (CDI) as a key metric for analyzing the density of optimal solutions in the landscape. The CDI reflects cluster quality and helps determine whether a given algorithm generates regions of high fidelity or not. Our results provide insights into effective quantum control strategies, emphasizing the significance of parameter selection and algorithm optimization.

[LG-63] Reconstruction of frequency-localized functions from pointwise samples via least squares and deep learning

链接: https://arxiv.org/abs/2502.09794
作者: A. Martina Neuman,Andres Felipe Lerma Pineda,Jason J. Bramburger,Simone Brugiapaglia
类目: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recovering frequency-localized functions from pointwise data is a fundamental task in signal processing. We examine this problem from an approximation-theoretic perspective, focusing on least squares and deep learning-based methods. First, we establish a novel recovery theorem for least squares approximations using the Slepian basis from uniform random samples in low dimensions, explicitly tracking the dependence of the bandwidth on the sampling complexity. Building on these results, we then present a recovery guarantee for approximating bandlimited functions via deep learning from pointwise data. This result, framed as a practical existence theorem, provides conditions on the network architecture, training procedure, and data acquisition sufficient for accurate approximation. To complement our theoretical findings, we perform numerical comparisons between least squares and deep learning for approximating one- and two-dimensional functions. We conclude with a discussion of the theoretical limitations and the practical gaps between theory and implementation.

信息检索

[IR-0] Cross-Format Retrieval-Augmented Generation in XR with LLM s for Context-Aware Maintenance Assistance

链接: https://arxiv.org/abs/2502.15604
作者: Akos Nagy,Yannis Spyridis,Vasileios Argyriou
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper presents a detailed evaluation of a Retrieval-Augmented Generation (RAG) system that integrates large language models (LLMs) to enhance information retrieval and instruction generation for maintenance personnel across diverse data formats. We assessed the performance of eight LLMs, emphasizing key metrics such as response speed and accuracy, which were quantified using BLEU and METEOR scores. Our findings reveal that advanced models like GPT-4 and GPT-4o-mini significantly outperform their counterparts, particularly when addressing complex queries requiring multi-format data integration. The results validate the system’s ability to deliver timely and accurate responses, highlighting the potential of RAG frameworks to optimize maintenance operations. Future research will focus on refining retrieval techniques for these models and enhancing response generation, particularly for intricate scenarios, ultimately improving the system’s practical applicability in dynamic real-world environments.

[IR-1] Scaling Sparse and Dense Retrieval in Decoder-Only LLM s

链接: https://arxiv.org/abs/2502.15526
作者: Hansi Zeng,Julian Killingback,Hamed Zamani
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Scaling large language models (LLMs) has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly only with CL, where larger models achieve significant performance gains, whereas KD-trained models show minimal improvement, performing similarly across the 1B, 3B, and 8B scales. (2) Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks, and they demonstrate greater robustness to imperfect supervised signals. (3) We successfully scale sparse retrieval models with the combination of CL and KD losses at 8B scale, achieving state-of-the-art (SOTA) results in all evaluation sets.

[IR-2] A Universal Framework for Compressing Embeddings in CTR Prediction DASFAA2025

链接: https://arxiv.org/abs/2502.15355
作者: Kefan Wang,Hao Wang,Kenan Song,Wei Guo,Kai Cheng,Zhi Li,Yong Liu,Defu Lian,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注: Accepted by DASFAA2025

点击查看摘要

Abstract:Accurate click-through rate (CTR) prediction is vital for online advertising and recommendation systems. Recent deep learning advancements have improved the ability to capture feature interactions and understand user interests. However, optimizing the embedding layer often remains overlooked. Embedding tables, which represent categorical and sequential features, can become excessively large, surpassing GPU memory limits and necessitating storage in CPU memory. This results in high memory consumption and increased latency due to frequent GPU-CPU data transfers. To tackle these challenges, we introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings, without sacrificing recommendation quality. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Then, we integrate a contrastive learning mechanism to ensure a uniform distribution of quantized codes, enhancing the distinctiveness of embeddings. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance compared to existing models. The implementation code is accessible in our project repository this https URL.

[IR-3] From Documents to Dialogue: Building KG-RAG Enhanced AI Assistants

链接: https://arxiv.org/abs/2502.15237
作者: Manisha Mukherjee,Sungchul Kim,Xiang Chen,Dan Luo,Tong Yu,Tung Mai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The Adobe Experience Platform AI Assistant is a conversational tool that enables organizations to interact seamlessly with proprietary enterprise data through a chatbot. However, due to access restrictions, Large Language Models (LLMs) cannot retrieve these internal documents, limiting their ability to generate accurate zero-shot responses. To overcome this limitation, we use a Retrieval-Augmented Generation (RAG) framework powered by a Knowledge Graph (KG) to retrieve relevant information from external knowledge sources, enabling LLMs to answer questions over private or previously unseen document collections. In this paper, we propose a novel approach for building a high-quality, low-noise KG. We apply several techniques, including incremental entity resolution using seed concepts, similarity-based filtering to deduplicate entries, assigning confidence scores to entity-relation pairs to filter for high-confidence pairs, and linking facts to source documents for provenance. Our KG-RAG system retrieves relevant tuples, which are added to the user prompts context before being sent to the LLM generating the response. Our evaluation demonstrates that this approach significantly enhances response relevance, reducing irrelevant answers by over 50% and increasing fully relevant answers by 88% compared to the existing production system.

[IR-4] GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer

链接: https://arxiv.org/abs/2502.15202
作者: Yufan Ye,Pu Pang,Ting Zhang,Hua Huang
类目: Information Retrieval (cs.IR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Code retrieval is a crucial component in modern software development, particularly in large-scale projects. However, existing approaches relying on sequence-based models often fail to fully exploit the structural dependencies inherent in code, leading to suboptimal retrieval performance, particularly with structurally complex code fragments. In this paper, we introduce GNN-Coder, a novel framework based on Graph Neural Network (GNN) to utilize Abstract Syntax Tree (AST). We make the first attempt to study how GNN-integrated Transformer can promote the development of semantic retrieval tasks by capturing the structural and semantic features of code. We further propose an innovative graph pooling method tailored for AST, utilizing the number of child nodes as a key feature to highlight the intrinsic topological relationships within the AST. This design effectively integrates both sequential and hierarchical representations, enhancing the model’s ability to capture code structure and semantics. Additionally, we introduce the Mean Angular Margin (MAM), a novel metric for quantifying the uniformity of code embedding distributions, providing a standardized measure of feature separability. The proposed method achieves a lower MAM, indicating a more discriminative feature representation. This underscores GNN-Coder’s superior ability to distinguish between code snippets, thereby enhancing retrieval accuracy. Experimental results show that GNN-Coder significantly boosts retrieval performance, with a 1%-10% improvement in MRR on the CSN dataset, and a notable 20% gain in zero-shot performance on the CosQA dataset.

[IR-5] Is Relevance Propagated from Retriever to Generator in RAG ? ECIR’25

链接: https://arxiv.org/abs/2502.15025
作者: Fangzheng Tian,Debasis Ganguly,Craig Macdonald
类目: Information Retrieval (cs.IR)
*备注: 18 pages (including reference), 5 figures, 1 table, 48 references; this paper has been accepted by ECIR’25 as a full paper

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a framework for incorporating external knowledge, usually in the form of a set of documents retrieved from a collection, as a part of a prompt to a large language model (LLM) to potentially improve the performance of a downstream task, such as question answering. Different from a standard retrieval task’s objective of maximising the relevance of a set of top-ranked documents, a RAG system’s objective is rather to maximise their total utility, where the utility of a document indicates whether including it as a part of the additional contextual information in an LLM prompt improves a downstream task. Existing studies investigate the role of the relevance of a RAG context for knowledge-intensive language tasks (KILT), where relevance essentially takes the form of answer containment. In contrast, in our work, relevance corresponds to that of topical overlap between a query and a document for an information seeking task. Specifically, we make use of an IR test collection to empirically investigate whether a RAG context comprised of topically relevant documents leads to improved downstream performance. Our experiments lead to the following findings: (a) there is a small positive correlation between relevance and utility; (b) this correlation decreases with increasing context sizes (higher values of k in k-shot); and © a more effective retrieval model generally leads to better downstream RAG performance.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-24

目录

概览 (2025-02-24)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

目录

概览 (2025-02-24)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

微信扫一扫：分享