Arxiv今日论文 | 2025-01-23

本篇博文主要内容为 2025-01-23 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决文本摘要（text summarization）中的性能评估问题，提出了一种基于信息论的框架来定义和计算摘要器的率失真函数（rate-distortion function）。该函数为摘要器的性能提供了一个理论下限，从而能够量化不同摘要器的性能差异。论文的关键解决方案包括：1）提出了一个类似于Blahut-Arimoto算法的迭代过程来计算率失真函数；2）针对实际文本数据集，提出了一种在有限数据条件下计算率失真函数的实用方法。通过实验验证，论文证实了该理论框架能够有效评估实际应用中不同摘要器的性能。

链接: https://arxiv.org/abs/2501.13100
作者: Enes Arda,Aylin Yener
机构: INSPIRE@OhioState Research Center(俄亥俄州立大学INSPIRE研究中心); Dept. of Electrical and Computer Engineering, The Ohio State University(俄亥俄州立大学电气与计算机工程系)
类目: Information Theory (cs.IT); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces an information-theoretic framework for text summarization. We define the summarizer rate-distortion function and show that it provides a fundamental lower bound on summarizer performance. We describe an iterative procedure, similar to Blahut-Arimoto algorithm, for computing this function. To handle real-world text datasets, we also propose a practical method that can calculate the summarizer rate-distortion function with limited data. Finally, we empirically confirm our theoretical results by comparing the summarizer rate-distortion function with the performances of different summarizers used in practice.
zh

[NLP-1] Refining Input Guardrails: Enhancing LLM -as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment

【速读】：该论文旨在解决大型语言模型（LLMs）在对话式AI产品中面临的安全性和可靠性问题，特别是如何应对恶意用户交互带来的风险。论文提出了一种通过微调和对齐思维链（Chain-of-Thought, CoT）响应的方法，作为输入审核的防护机制。关键解决方案包括利用少量训练数据，系统性地探索多种微调方法，使模型能够检测恶意输入并提供推理依据，从而防止对话代理被滥用。实验结果表明，即使数据资源有限，这些对齐过程也能显著提升对话AI系统的安全性，并为部署更安全、可信的AI驱动交互提供了可行的框架。

链接: https://arxiv.org/abs/2501.13080
作者: Melissa Kazemi Rad,Huy Nghiem,Andy Luo,Sahil Wadhwa,Mohammad Sorower,Stephen Rawls
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated powerful capabilities that render them valuable in different applications, including conversational AI products. It is paramount to ensure the security and reliability of these products by mitigating their vulnerabilities towards malicious user interactions, which can lead to the exposure of great risks and reputational repercussions. In this work, we present a comprehensive study on the efficacy of fine-tuning and aligning Chain-of-Thought (CoT) responses of different LLMs that serve as input moderation guardrails. We systematically explore various tuning methods by leveraging a small set of training data to adapt these models as proxy defense mechanisms to detect malicious inputs and provide a reasoning for their verdicts, thereby preventing the exploitation of conversational agents. We rigorously evaluate the efficacy and robustness of different tuning strategies to generalize across diverse adversarial and malicious query types. Our experimental results outline the potential of alignment processes tailored to a varied range of harmful input queries, even with constrained data resources. These techniques significantly enhance the safety of conversational AI systems and provide a feasible framework for deploying more secure and trustworthy AI-driven interactions.
zh

[NLP-2] Autonomy-of-Experts Models

【速读】：该论文试图解决混合专家模型（Mixture-of-Experts, MoE）中路由器的决策与专家模块执行之间的分离问题，这种分离导致专家选择次优化和学习效果不佳。为了解决这一问题，论文提出了一种新的MoE范式，称为专家自主性（Autonomy-of-Experts, AoE）。AoE的关键在于专家模块能够自主选择处理输入，而不是依赖路由器进行分配。具体来说，AoE通过专家模块预计算内部激活值，并根据激活值的范数进行排序，只有排名靠前的专家模块才会继续执行前向传播，而其他模块则中止。这种方法通过低秩权重分解减少了预计算激活值的开销，确保了更优的专家选择和更有效的学习。实验表明，AoE在700M到4B参数的预训练语言模型中表现优于传统的MoE模型，且具有相当的效率。

链接: https://arxiv.org/abs/2501.13074
作者: Ang Lv,Ruobing Xie,Yining Qian,Songhao Wu,Xingwu Sun,Zhanhui Kang,Di Wang,Rui Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router’s decision-making and the experts’ execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.
zh

[NLP-3] Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理科学表格时面临的挑战，特别是由于固定输入图像分辨率和数值推理能力不足所导致的局限性。论文提出了一种综合框架，用于多模态科学表格的理解和推理，并支持动态输入图像分辨率。解决方案的关键包括三个核心组件：(1) MMSci-Pre，一个包含52K科学表格结构识别样本的领域特定表格结构学习数据集；(2) MMSci-Ins，一个包含12K样本的指令调优数据集，涵盖三种基于表格的任务；(3) MMSci-Eval，一个包含3,114个测试样本的基准，专门用于评估数值推理能力。实验表明，与150K通用领域表格相比，使用52K科学表格图像的领域特定方法在性能上表现更优，强调了数据质量的重要性。此外，提出的支持动态输入分辨率的表格型MLLMs在通用表格理解和数值推理能力上均显示出显著提升，并具有良好的泛化能力。

链接: https://arxiv.org/abs/2501.13042
作者: Bohao Yang,Yingji Zhang,Dong Liu,André Freitas,Chenghua Lin
机构: The University of Manchester(曼彻斯特大学); Tencent Timi Studio(腾讯天美工作室); Idiap Research Institute(Idiap研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have advanced table understanding capabilities but rely on converting tables into text sequences. While multimodal large language models (MLLMs) enable direct visual processing, they face limitations in handling scientific tables due to fixed input image resolutions and insufficient numerical reasoning capabilities. We present a comprehensive framework for multimodal scientific table understanding and reasoning with dynamic input image resolutions. Our framework consists of three key components: (1) MMSci-Pre, a domain-specific table structure learning dataset of 52K scientific table structure recognition samples, (2) MMSci-Ins, an instruction tuning dataset with 12K samples across three table-based tasks, and (3) MMSci-Eval, a benchmark with 3,114 testing samples specifically designed to evaluate numerical reasoning capabilities. Extensive experiments demonstrate that our domain-specific approach with 52K scientific table images achieves superior performance compared to 150K general-domain tables, highlighting the importance of data quality over quantity. Our proposed table-based MLLMs with dynamic input resolutions show significant improvements in both general table understanding and numerical reasoning capabilities, with strong generalisation to held-out datasets. Our code and data are publicly available at this https URL.
zh

[NLP-4] Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament

【速读】：该论文旨在解决传统奖励模型（Reward Model）在大语言模型（LLMs）测试时扩展中存在的评分不一致和任意性问题。传统奖励模型通常为候选解决方案分配绝对分数，导致评分缺乏一致性和可靠性。为此，论文提出了一种结合淘汰赛机制的成对奖励模型（Pairwise Reward Model, Pairwise RM）。该模型通过同时评估两个候选解决方案的正确性，避免了绝对评分的任意性，并通过并行比较实现解决方案的交叉验证。在淘汰赛中，Pairwise RM通过迭代的成对比较逐步淘汰错误的解决方案。论文还构建了一个包含443K成对比较的大规模数据集（ourdataset），并通过监督微调训练Pairwise RM。实验结果表明，该方法在MATH-500和奥赛基准测试上显著优于传统的判别式奖励模型，尤其是在前50%的难题上实现了40%到60%的相对改进。

链接: https://arxiv.org/abs/2501.13007
作者: Yantao Liu,Zijun Yao,Rui Min,Yixin Cao,Lei Hou,Juanzi Li
机构: Fudan University(复旦大学); Tsinghua University(清华大学); Hong Kong University of Science and Technology(香港科技大学)
类目: Computation and Language (cs.CL)
备注: in progress work

点击查看摘要

Abstract:Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions’ correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using \textttgemini-1.5-flash, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40% to 60% relative improvement is achieved on the top 50% challenging problems.
zh

[NLP-5] Implicit Causality-biases in humans and LLM s as a tool for benchmarking LLM discourse capabilities

【速读】：该论文旨在通过比较单语和多语大语言模型（LLMs）生成的数据与人类参与者在实验环境中提供的数据，探讨LLMs在处理话语偏见（discourse biases）方面的能力。研究特别关注了隐含因果关系动词（Implicit Causality verbs），并以此为基准评估LLMs在更广泛话语理解能力中的表现。研究通过四个实验分别考察了三个现象：(i) 共指关系（coreference relations）的建立，(ii) 连贯关系（coherence relations）的建立，以及(iii) 特定指代表达（referring expressions）的使用。研究发现，仅在共指偏见方面，最大的单语LLM（German Bloom 6.4B）表现出更接近人类的偏见；而在连贯关系方面，所有LLMs均未显示出人类常见的解释偏见；在指代表达方面，所有LLMs均倾向于使用更简单的形式指代主语而非宾语，但未发现与人类偏见一致的效果。研究的关键在于通过实验数据对比，揭示LLMs在话语偏见处理上的局限性，并为未来模型改进提供基准。

链接: https://arxiv.org/abs/2501.12980
作者: Florian Kankowski,Torgrim Solstad,Sina Zarriess,Oliver Bott
机构: Bielefeld University/CRC 1646 (比勒费尔德大学/CRC 1646)
类目: Computation and Language (cs.CL)
备注: 38 pages, 8 figures

点击查看摘要

Abstract:In this paper, we compare data generated with mono- and multilingual LLMs spanning a range of model sizes with data provided by human participants in an experimental setting investigating well-established discourse biases. Beyond the comparison as such, we aim to develop a benchmark to assess the capabilities of LLMs with discourse biases as a robust proxy for more general discourse understanding capabilities. More specifically, we investigated Implicit Causality verbs, for which psycholinguistic research has found participants to display biases with regard to three phenomena:\ the establishment of (i) coreference relations (Experiment 1), (ii) coherence relations (Experiment 2), and (iii) the use of particular referring expressions (Experiments 3 and 4). With regard to coreference biases we found only the largest monolingual LLM (German Bloom 6.4B) to display more human-like biases. For coherence relation, no LLM displayed the explanation bias usually found for humans. For referring expressions, all LLMs displayed a preference for referring to subject arguments with simpler forms than to objects. However, no bias effect on referring expression was found, as opposed to recent studies investigating human biases.
zh

[NLP-6] FlanEC: Exploring Flan-T5 for Post-ASR Error Correction

【速读】：该论文旨在解决自动语音识别（ASR）后处理中的生成式语音错误校正（GenSEC）问题，特别是通过提升ASR输出的语言正确性、准确性和语法性。论文提出的解决方案关键是一种基于Flan-T5的编码器-解码器模型（FlanEC），该模型通过将ASR模型生成的n-best假设列表映射为单一输出句子来实现错误校正。研究重点探讨了扩大训练数据和引入多样化数据集是否能够显著提升后ASR错误校正的效果。通过在HyPoradise数据集上的评估，论文全面分析了FlanEC在该领域的有效性，并评估了模型在不同设置下的可扩展性和效率，为指令调优的编码器-解码器模型在该任务中的潜力提供了有价值的见解。

链接: https://arxiv.org/abs/2501.12979
作者: Moreno La Quatra,Valerio Mario Salerno,Yu Tsao,Sabato Marco Siniscalchi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at the 2024 IEEE Workshop on Spoken Language Technology (SLT) - GenSEC Challenge

点击查看摘要

Abstract:In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. We explore its application within the GenSEC framework to enhance ASR outputs by mapping n-best hypotheses into a single output sentence. By utilizing n-best lists from ASR models, we aim to improve the linguistic correctness, accuracy, and grammaticality of final ASR transcriptions. Specifically, we investigate whether scaling the training data and incorporating diverse datasets can lead to significant improvements in post-ASR error correction. We evaluate FlanEC using the HyPoradise dataset, providing a comprehensive analysis of the model’s effectiveness in this domain. Furthermore, we assess the proposed approach under different settings to evaluate model scalability and efficiency, offering valuable insights into the potential of instruction-tuned encoder-decoder models for this task.
zh

[NLP-7] OnionEval: An Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language Models

【速读】：该论文试图解决小型大语言模型（Small Large Language Models, SLLMs）在事实冲突性幻觉（fact-conflicting hallucination）方面的问题。尽管SLLMs在多种任务中表现良好，但它们与大型模型一样存在幻觉倾向，且在不同基准测试中表现差异较大。论文提出了一个名为OnionEval的多层结构化评估框架，并引入了一个称为上下文影响分数（context-influence score, CI）的特定指标，用于有效评估SLLMs在不同上下文层次中的事实冲突性幻觉倾向。实验结果表明，SLLMs在事实分析方面表现出色，但在上下文推理方面存在挑战。进一步研究表明，简单的思维链（Chain-of-Thought）策略可以显著减少这些限制，提升SLLMs在实际应用中的实用性。

链接: https://arxiv.org/abs/2501.12975
作者: Chongren Sun,Yuran Li,Di Wu,Benoit Boulet
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are highly capable but require significant computational resources for both training and inference. Within the LLM family, smaller models (those with fewer than 10 billion parameters) also perform well across various tasks. However, these smaller models share similar limitations to their larger counterparts, including the tendency to hallucinate. Despite the existence of many benchmarks to evaluate hallucination in LLMs, few have specifically focused on small LLMs (SLLMs). Additionally, SLLMs show widely varying performance across different benchmarks. In this paper, we introduce OnionEval, a multi-layer structured framework with a specific metric called the context-influence score (CI), designed to effectively assess the fact-conflicting hallucination tendencies of small LLMs across different contextual levels. Our experimental results reveal a key feature of SLLMs: they excel in factual analysis but face challenges with context reasoning. Further investigation shows that a simple Chain-of-Thought strategy can significantly reduce these limitations, improving the practical usefulness of SLLMs in real-world applications.
zh

[NLP-8] Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

【速读】：该论文旨在解决大语言模型（LLMs）在处理长上下文输入时面临的计算成本增加和性能下降的问题。为了解决这一挑战，论文提出了一种无需训练的提示压缩方法，称为基于评估头（Evaluator Head-based Prompt Compression, EHPC）的提示压缩方法。该方法的关键在于识别并利用Transformer架构中的特定注意力头（evaluator heads），这些头能够在长输入中选择对推理最重要的token。通过在前填充阶段仅使用带有评估头的少数层来快速“浏览”输入提示，并将重要token传递给模型进行推理，EHPC显著降低了计算复杂度和商业API调用的成本。实验结果表明，EHPC在提示压缩和长上下文推理加速两个主流基准测试中达到了最先进的性能，并且与基于键值缓存（key-value cache）的加速方法相比具有竞争力，展示了其在提升LLMs长上下文任务效率方面的潜力。

链接: https://arxiv.org/abs/2501.12959
作者: Weizhi Fei,Xueyan Niu,Guoqing Xie,Yingqing Liu,Bo Bai,Wei Han
机构: Department of Mathematical Sciences, Tsinghua University (清华大学数学科学系); Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd. (华为技术有限公司理论实验室, 2012实验室); Architecture & Design, ICT Products & Solutions, Huawei Technologies Co., Ltd. (华为技术有限公司ICT产品与解决方案架构与设计)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly “skim through” input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art results across two mainstream benchmarks: prompt compression and long-context inference acceleration. Consequently, it effectively reduces the complexity and costs associated with commercial API calls. We further demonstrate that EHPC attains competitive results compared to key-value cache-based acceleration methods, thereby highlighting its potential to enhance the efficiency of LLMs for long-context tasks.
zh

[NLP-9] Multifractal hopscotch in “Hopscotch” by Julio Cortazar

【速读】：该论文试图解决的问题是标点符号在自然语言书面文本中的相关性及其对文本整体效果、表达力和可读性的影响，特别是句末标点符号的分布如何决定书面自然语言的复杂性特征。通过对Julio Cortazar的小说《Hopscotch》的句子长度变异性（SLV）时间序列进行定量分析，研究其分布类型、长记忆效应以及潜在的多尺度模式。解决方案的关键在于对小说原文（西班牙语）及其英语和波兰语翻译版本的SLV动态进行统计分析，揭示出所有版本中均存在丰富的多重分形性（multifractality），并表现出左偏不对称性。这一分析不仅验证了标点符号对文本复杂性的影响，还展示了不同语言版本和章节顺序对文本统计特性的影响。

链接: https://arxiv.org/abs/2501.12955
作者: Jakub Dec,Michał Dolina,Stanisław Drożdż,Jarosław Kwapień,Tomasz Stanisz
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Punctuation is the main factor introducing correlations in natural language written texts and it crucially impacts their overall effectiveness, expressiveness, and readability. Punctuation marks at the end of sentences are of particular importance as their distribution can determine various complexity features of written natural language. Here, the sentence length variability (SLV) time series representing “Hopscotch” by Julio Cortazar are subjected to quantitative analysis with an attempt to identify their distribution type, long-memory effects, and potential multiscale patterns. The analyzed novel is an important and innovative piece of literature whose essential property is freedom of movement between its building blocks given to a reader by the author. The statistical consequences of this freedom are closely investigated in both the original, Spanish version of the novel, and its translations into English and Polish. Clear evidence of rich multifractality in the SLV dynamics, with a left-sided asymmetry, however, is observed in all three language versions as well as in the versions with differently ordered chapters.
zh

[NLP-10] Punctuation patterns in “Finnegans Wake” by James Joyce are largely translation-invariant

【速读】：该论文探讨了自然语言文本中标点符号的复杂性特征，特别是标点符号之间的距离（以单词数衡量）普遍遵循生存分析中的Weibull分布（Weibull distribution）。研究发现，不同语言的标点符号分布具有特定的参数值，且翻译文本的标点符号分布会呈现出目标语言的定量特征。然而，James Joyce的《Finnegans Wake》在标点符号分布上表现出极端的Weibull分布特征，其对应的风险函数（hazard function）呈下降趋势，且句子结束标点符号的距离呈现出几乎完美的多重分形组织（multifractal organization），这在其他文献中尚未发现。通过对《Finnegans Wake》的多种翻译版本（荷兰语、法语、德语、波兰语、俄语）的分析，论文发现该作品的标点符号特征在翻译过程中基本保持不变，这与通常情况相反。这一发现进一步支持了《Finnegans Wake》在标点符号特征上具有跨语言性（translinguistic），符合Joyce的创作意图。论文的关键在于通过分析标点符号的分布特征，揭示了《Finnegans Wake》在语言结构上的独特性及其翻译过程中的不变性。

链接: https://arxiv.org/abs/2501.12954
作者: Krzysztof Bartnicki,Stanisław Drożdż,Jarosław Kwapień,Tomasz Stanisz
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The complexity characteristics of texts written in natural languages are significantly related to the rules of punctuation. In particular, the distances between punctuation marks measured by the number of words quite universally follow the family of Weibull distributions known from survival analyses. However, the values of two parameters marking specific forms of these distributions distinguish specific languages. This is such a strong constraint that the punctuation distributions of texts translated from the original language into another adopt quantitative characteristics of the target language. All these changes take place within Weibull distributions such that the corresponding hazard functions are always increasing. Recent previous research shows that James Joyce’s famous “Finnegans Wake” is subject to such extreme distribution from the Weibull family that the corresponding hazard function is clearly decreasing. At the same time, the distances of sentence ending punctuation marks, determining the variability of sentence length, have an almost perfect multifractal organization, so far to such an extent found nowhere else in the literature. In the present contribution based on several available translations (Dutch, French, German, Polish, Russian) of “Finnegans Wake”, it is shown that the punctuation characteristics of this work remain largely translation invariant, contrary to the common cases. These observations may constitute further evidence that “Finnegans Wake” is a translinguistic work in this respect as well, in line with Joyce’s original intention.
zh

[NLP-11] DeepSeek -R1: Incentivizing Reasoning Capability in LLM s via Reinforcement Learning

【速读】：该论文旨在解决大规模强化学习（RL）训练模型在推理任务中面临的挑战，如可读性差和语言混合问题。论文提出了两种模型：DeepSeek-R1-Zero和DeepSeek-R1。DeepSeek-R1-Zero通过大规模强化学习训练，无需监督微调（SFT），展现出强大的推理能力，但仍存在上述问题。为解决这些问题并进一步提升推理性能，论文引入了DeepSeek-R1，该模型在强化学习前采用了多阶段训练和冷启动数据（cold-start data）的策略。DeepSeek-R1在推理任务上的表现与OpenAI-o1-1217相当。此外，论文还开源了DeepSeek-R1-Zero、DeepSeek-R1以及基于Qwen和Llama的六个密集模型（1.5B, 7B, 8B, 14B, 32B, 70B），以支持研究社区。解决方案的关键在于通过多阶段训练和冷启动数据优化强化学习过程，从而提升模型的可读性和推理性能。

链接: https://arxiv.org/abs/2501.12948
作者: DeepSeek-AI,Daya Guo,Dejian Yang,Haowei Zhang,Junxiao Song,Ruoyu Zhang,Runxin Xu,Qihao Zhu,Shirong Ma,Peiyi Wang,Xiao Bi,Xiaokang Zhang,Xingkai Yu,Yu Wu,Z.F. Wu,Zhibin Gou,Zhihong Shao,Zhuoshu Li,Ziyi Gao,Aixin Liu,Bing Xue,Bingxuan Wang,Bochao Wu,Bei Feng,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenyu Zhang,Chong Ruan,Damai Dai,Deli Chen,Dongjie Ji,Erhang Li,Fangyun Lin,Fucong Dai,Fuli Luo,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Han Bao,Hanwei Xu,Haocheng Wang,Honghui Ding,Huajian Xin,Huazuo Gao,Hui Qu,Hui Li,Jianzhong Guo,Jiashi Li,Jiawei Wang,Jingchang Chen,Jingyang Yuan,Junjie Qiu,Junlong Li,J.L. Cai,Jiaqi Ni,Jian Liang,Jin Chen,Kai Dong,Kai Hu,Kaige Gao,Kang Guan,Kexin Huang,Kuai Yu,Lean Wang,Lecong Zhang,Liang Zhao,Litong Wang,Liyue Zhang,Lei Xu,Leyi Xia,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Meng Li,Miaojun Wang,Mingming Li,Ning Tian,Panpan Huang,Peng Zhang,Qiancheng Wang,Qinyu Chen,Qiushi Du,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,R.J. Chen,R.L. Jin,Ruyi Chen,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shengfeng Ye,Shiyu Wang,Shuiping Yu,Shunfeng Zhou,Shuting Pan,S.S. Li
机构: DeepSeek-AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
zh

[NLP-12] Ontology-Enhanced Educational Annotation Activities

【速读】：该论文试图解决在技术增强学习（Technology-Enhanced Learning）环境中，学生在进行文档注释（Document Annotation）时，由于缺乏足够的领域知识和专家分析能力，导致注释内容不相关、错误或脱离上下文的问题。论文提出的关键解决方案是使用引导性注释本体（Guiding Annotation Ontology）来指导学生的注释活动。通过这种本体增强的注释范式，学生能够更好地理解文档内容，进行全面的内容分析，并发展元反思思维（Meta-Reflective Thinking）。论文通过描述其开发的注释工具 @note，并提供了实验证据，证明该工具在提高学术表现方面的有效性，特别是在文学批评注释方面的应用。

链接: https://arxiv.org/abs/2501.12943
作者: Joaquí Gayoso-Cabada,María Goicoechea-de-Jorge,Mercedes Gómez-Albarrán,Amelia Sanz-Cabrerizo,Antonio Sarasa-Cabezuelo,José-Luis Sierra
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Information and communications technology and technology-enhanced learning have unquestionably transformed traditional teaching-learning processes and are positioned as key factors to promote quality education, one of the basic sustainable development goals of the 2030 agenda. Document annotation, which was traditionally carried out with pencil and paper and currently benefits from digital document annotation tools, is a representative example of this transformation. Using document annotation tools, students can enrich the documents with annotations that highlight the most relevant aspects of these documents. As the conceptual complexity of the learning domain increases, the annotation of the documents may require comprehensive domain knowledge and an expert analysis capability that students usually lack. Consequently, a proliferation of irrelevant, incorrect, and/or poorly decontextualized annotations may appear, while other relevant aspects are completely ignored by the students. The main hypothesis proposed by this paper is that the use of a guiding annotation ontology in the annotation activities is a keystone aspect to alleviate these shortcomings. Consequently, comprehension is improved, exhaustive content analysis is promoted, and meta-reflective thinking is developed. To test this hypothesis, we describe our own annotation tool, @note, which fully implements this ontology-enhanced annotation paradigm, and we provide experimental evidence about how @note can improve academic performance via a pilot study concerning critical literary annotation.
zh

[NLP-13] FilmAgent : A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

【速读】：该论文旨在解决虚拟电影制作中复杂的决策过程问题，包括剧本创作、虚拟摄影以及演员的精确定位和动作设计。为了解决这些问题，论文提出了FilmAgent，这是一个基于大语言模型（LLM）的多智能体协作框架，用于在3D虚拟空间中进行端到端的电影自动化制作。FilmAgent模拟了电影制作中的多个角色，如导演、编剧、演员和摄影师，并涵盖了电影制作流程的关键阶段：1）创意开发，将头脑风暴的想法转化为结构化的故事大纲；2）剧本创作，详细描述每个场景的对话和角色动作；3）摄影，确定每个镜头的摄像机设置。通过多智能体的迭代反馈和修订，FilmAgent能够验证中间剧本并减少幻觉（hallucinations）现象。实验结果表明，FilmAgent在多个方面优于所有基线模型，展示了多智能体协作在电影制作中的可行性。尽管使用了相对较不先进的GPT-4o模型，FilmAgent仍超越了单智能体系统，显示了协调良好的多智能体系统的优势。此外，论文还讨论了OpenAI的文本到视频模型Sora与FilmAgent在电影制作中的互补优势和劣势。

链接: https://arxiv.org/abs/2501.12909
作者: Zhenran Xu,Longyue Wang,Jifang Wang,Zhouyi Li,Senbao Shi,Xue Yang,Yiyu Wang,Baotian Hu,Jun Yu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Graphics (cs.GR); Multiagent Systems (cs.MA)
备注: Work in progress. Project Page: this https URL

点击查看摘要

Abstract:Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI’s text-to-video model Sora and our FilmAgent in filmmaking.
zh

[NLP-14] Architectural Fusion Through Contextual Partitioning in Large Language Models : A Novel Approach to Parameterized Knowledge Integration

【速读】：该论文旨在解决大规模计算模型在架构设计中的冗余和计算效率低下的问题，特别是在处理复杂语言任务时，传统参数优化技术存在的外部微调需求和适应性不足的局限性。论文提出的解决方案是“上下文分区”（Contextual Partitioning），其核心在于通过动态分割参数为上下文感知区域，实现任务特定的专业化。该方法通过自适应参数分配机制，使模型能够根据输入数据的语言特征进行动态调整，从而减少冗余并提升计算效率。实验结果表明，该方法在准确性、困惑度和上下文一致性方面均有显著提升，同时显著降低了内存使用和训练时间。此外，该方法无需外部微调，能够自主运行，进一步增强了其在实际应用中的可扩展性和适应性。

链接: https://arxiv.org/abs/2501.12901
作者: Offa Kingsleigh,Alfred Abercrombie,David Woolstencroft,Beorhtric Meadowcroft,Marcus Irvin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contextual Partitioning introduces an innovative approach to enhancing the architectural design of large-scale computational models through the dynamic segmentation of parameters into context-aware regions. This methodology emphasizes the importance of task-specific specialization, achieved through adaptive parameter allocation mechanisms that align with the linguistic features of input data. Experimental evaluations demonstrated substantial improvements in accuracy, perplexity, and contextual coherence across a variety of linguistic tasks, highlighting the adaptability and scalability of the proposed framework. By reducing redundancy and enhancing computational efficiency, Contextual Partitioning not only streamlines model operations but also expands the scope of applications for advanced language processing systems. The approach operates autonomously, requiring no external fine-tuning, thereby addressing a significant limitation in conventional parameter optimization techniques. Empirical results demonstrate the effectiveness of gradient-driven segmentation, enabling models to dynamically recalibrate and specialize in response to task-specific demands. Furthermore, resource utilization metrics reveal notable reductions in memory usage and training times, confirming the efficiency of the approach. Observations from qualitative analyses illustrate improved contextual coherence and logical flow in generated outputs, reinforcing the practical value of this technique. The findings collectively demonstrate the potential for Contextual Partitioning to redefine the scalability and adaptability of computational language architectures in diverse and complex domains.
zh

[NLP-15] st-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

【速读】：该论文旨在解决大型语言模型（LLMs）在推理过程中难以快速适应人类偏好（human preferences）的问题，而无需重新训练模型。现有的方法通常依赖于数值奖励信号，但缺乏灵活性。为此，作者提出了测试时偏好优化（Test-time Preference Optimization, TPO）框架，通过在推理阶段将奖励信号转化为文本形式的批评（textual critiques），并作为文本奖励来迭代优化模型输出。TPO的关键在于利用LLM的固有能力来解释和响应奖励信号，从而在无需更新模型参数的情况下，逐步提升模型输出与人类偏好的对齐度。实验表明，TPO在指令遵循、偏好对齐、安全性和数学任务等基准测试中显著提升了模型的表现，且具有较高的搜索宽度和深度扩展性。

链接: https://arxiv.org/abs/2501.12895
作者: Yafu Li,Xuyang Hu,Xiaoye Qu,Linjie Li,Yu Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 43 pages; work in progress

点击查看摘要

Abstract:Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at this https URL.
zh

[NLP-16] WisdomBot: Tuning Large Language Models with Artificial Intelligence Knowledge

【速读】：该论文试图解决大型语言模型（LLMs）在教育领域表现不佳的问题，主要挑战包括需要更专业的知识、个性化的学习体验以及对复杂概念的简洁解释。为解决这些问题，论文提出了一种名为WisdomBot的新型教育专用LLM，结合了LLMs的强大能力与教育理论，使其能够无缝融入教育场景。解决方案的关键在于利用基于布鲁姆分类法（Bloom’s Taxonomy）的自指导知识概念和指令作为训练数据，并在推理过程中引入本地知识库检索增强和搜索引擎检索增强，以提高模型在事实性问题上的准确性和专业性。通过在中国LLMs上的应用，验证了该方法的有效性，表明微调后的模型能够生成更可靠和专业的响应。

链接: https://arxiv.org/abs/2501.12877
作者: Jingyuan Chen,Tao Wu,Wei Ji,Fei Wu
机构: Zhejiang University(浙江大学); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: Frontiers of Digital Education

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools in natural language processing (NLP), showing a promising future of artificial generated intelligence (AGI). Despite their notable performance in the general domain, LLMs have remained suboptimal in the field of education, owing to the unique challenges presented by this domain, such as the need for more specialized knowledge, the requirement for personalized learning experiences, and the necessity for concise explanations of complex concepts. To address these issues, this paper presents a novel LLM for education named WisdomBot, which combines the power of LLMs with educational theories, enabling their seamless integration into educational contexts. To be specific, we harness self-instructed knowledge concepts and instructions under the guidance of Bloom’s Taxonomy as training data. To further enhance the accuracy and professionalism of model’s response on factual questions, we introduce two key enhancements during inference, i.e., local knowledge base retrieval augmentation and search engine retrieval augmentation during inference. We substantiate the effectiveness of our approach by applying it to several Chinese LLMs, thereby showcasing that the fine-tuned models can generate more reliable and professional responses.
zh

[NLP-17] ACEBench: Who Wins the Match Point in Tool Learning?

【速读】：该论文旨在解决现有评估系统在评估大语言模型（LLMs）功能调用能力时的局限性。具体问题包括：评估场景有限，缺乏在多轮对话上下文中的评估；评估维度狭窄，缺乏对细粒度功能调用的详细评估；以及依赖LLMs或实际API执行进行结果评估，导致显著的开销。为解决这些问题，论文提出了一个名为ACEBench的综合评估系统。该系统的关键设计在于涵盖广泛的功能调用场景，并将其根据评估方法分为三类：Normal（基本场景中的功能调用评估）、Special（模糊或不完整指令场景中的功能调用评估）和Agent（引入多智能体交互以模拟真实世界多轮交互中的功能调用评估）。通过ACEBench，论文进行了广泛的实验，深入分析了各种LLMs，并对不同数据类型中的错误原因进行了更细粒度的分析。

链接: https://arxiv.org/abs/2501.12851
作者: Chen Chen,Xinlong Hao,Weiwen Liu,Xu Huang,Xingshan Zeng,Shuai Yu,Dexun Li,Shuai Wang,Weinan Gan,Yuefeng Huang,Xinzhi Wang,Defu Lian,Baoqun Yin,Yasheng Wang,Wu Liu
机构: University of Science and Technology of China(中国科学技术大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential in decision-making and reasoning, especially when combined with various tools to effectively solve complex problems. However, existing evaluation systems for assessing LLM function calling capabilities have several limitations: (1) limited evaluation scenarios, lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, lacking detailed assessments for fine-grained function calls; (3) relying on LLMs or real API executions for result evaluation, which introduces significant overhead. To address these issues, we propose a comprehensive evaluation system named ACEBench. This system is meticulously designed to encompass a wide spectrum of function calling scenarios. Moreover, it categorizes these scenarios into three primary types according to the evaluation methodology: Normal, Special, and Agent. Normal evaluates function calls in basic scenarios; Special evaluates function calls in scenarios with vague or incomplete instructions; Agent introduces multi-agent interactions to simulate function calling evaluation in real-world multi-turn interactions. We conducted extensive experiments on ACEBench, analyzing various LLMs in-depth and performing a more granular analysis of error causes across different data types.
zh

[NLP-18] Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home

【速读】：该论文试图解决在问答系统（Question Answering, QA）中使用检索增强生成（Retrieval Augmented Generation, RAG）时带来的计算成本增加和引入无关信息的问题。尽管RAG能够提高问答的准确性并减少大语言模型（Large Language Models, LLMs）中的幻觉（hallucinations），但其计算开销较大且并非总是必要。论文的关键解决方案是通过综合分析35种自适应检索方法（adaptive retrieval methods），包括8种最新方法和27种不确定性估计技术（uncertainty estimation techniques），来评估这些方法在问答性能、自我知识（self-knowledge）和效率方面的表现。研究发现，不确定性估计技术在效率和自我知识方面通常优于复杂的检索增强生成流程，同时保持相当的问答性能。

链接: https://arxiv.org/abs/2501.12835
作者: Viktor Moskvoretskii,Maria Lysyuk,Mikhail Salnikov,Nikolay Ivanov,Sergey Pletenev,Daria Galimzianova,Nikita Krayko,Vasily Konovalov,Irina Nikishina,Alexander Panchenko
机构: 1Skoltech; 2AIRI; 3HSE University; 4MTS AI; 5MIPT; 6University of Hamburg
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The code and data will be published soon

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs’ intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.
zh

[NLP-19] Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek

【速读】：该论文旨在解决低资源语言（lesser-resourced languages）在自然语言处理（NLP）领域面临的挑战，包括数据集有限、从高资源语言继承的偏见以及需要领域特定解决方案等问题。针对现代希腊语（Modern Greek），论文提出了三个关键贡献：首先，评估了开源（Llama-70b）和闭源（GPT-4o mini）大语言模型（LLMs）在七个核心NLP任务上的表现，揭示了它们在任务特定方面的优势和弱点，并展示了性能的相似性。其次，通过将作者归属（Authorship Attribution）重新定义为评估LLMs在预训练中潜在数据使用情况的工具，展示了高零样本（0-shot）准确率，提出了数据来源的伦理问题。最后，通过一个法律NLP案例研究，展示了“总结、翻译和嵌入”（STE）方法在聚类长法律文本时优于传统的TF-IDF方法。这些贡献共同为低资源语言的NLP发展提供了路线图，填补了模型评估、任务创新和实际应用中的空白。

链接: https://arxiv.org/abs/2501.12826
作者: John Pavlopoulos,Juli Bakagianni,Kanella Pouli,Maria Gavriilidou
机构: AUEB & Archimedes/Athena RC, Greece; University of Ioannina, Greece; ILSP/Athena RC, Greece
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NLP, Modern Greek, benchmark, machine learning, language resources

点击查看摘要

Abstract:Natural Language Processing (NLP) for lesser-resourced languages faces persistent challenges, including limited datasets, inherited biases from high-resource languages, and the need for domain-specific solutions. This study addresses these gaps for Modern Greek through three key contributions. First, we evaluate the performance of open-source (Llama-70b) and closed-source (GPT-4o mini) large language models (LLMs) on seven core NLP tasks with dataset availability, revealing task-specific strengths, weaknesses, and parity in their performance. Second, we expand the scope of Greek NLP by reframing Authorship Attribution as a tool to assess potential data usage by LLMs in pre-training, with high 0-shot accuracy suggesting ethical implications for data provenance. Third, we showcase a legal NLP case study, where a Summarize, Translate, and Embed (STE) methodology outperforms the traditional TF-IDF approach for clustering \emphlong legal texts. Together, these contributions provide a roadmap to advance NLP in lesser-resourced languages, bridging gaps in model evaluation, task innovation, and real-world impact.
zh

[NLP-20] Generation of Standardized E-Learning Contents from Digital Medical Collections

【速读】：该论文旨在解决如何将现有在线医学知识库中的大量医学知识转化为标准化的学习包，以便集成到流行的电子学习平台中的问题。解决方案的关键在于使用一个名为Clavy的工具，该工具能够从医学知识库中检索内容片段，将其转化为有意义的学习单元，并以标准化的学习包形式导出。论文通过将这一方法应用于从MedPix（一个流行的放射学领域医学案例在线数据库）生成IMS内容包，展示了该方法的可行性。

链接: https://arxiv.org/abs/2501.12794
作者: Felix Buendía,Joaquín Gayoso-Cabada,José-Luis Sierra
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we describe an approach to transforming the huge amount of medical knowledge available in existing online medical collections into standardized learning packages ready to be integrated into the most popular e-learning platforms. The core of our approach is a tool called Clavy, which makes it possible to retrieve pieces of content in medical collections, to transform this content into meaningful learning units, and to export it in the form of standardized learning packages. In addition to describing the approach, we demonstrate its feasibility by applying it to the generation of IMS content packages from MedPix, a popular online database of medical cases in the domain of radiology.
zh

[NLP-21] Generating Diverse QA Benchmarks for RAG Evaluation with DataMorgana

【速读】：该论文试图解决在特定领域背景下评估检索增强生成（Retrieval-Augmented Generation, RAG）系统时，现有基准测试工具生成的问题多样性不足的问题。现有的通用方法通常基于大语言模型（LLM）生成问答对（QA pairs），尽管单个生成的问题质量可能较高，但整体上缺乏足够的多样性，无法全面反映真实用户与RAG系统的多种交互方式。为此，论文提出了DataMorgana工具，其关键解决方案在于通过一个轻量级的两阶段流程，生成高度可定制且多样化的合成问答基准测试数据。DataMorgana允许详细配置用户和问题类别，并控制其在基准测试中的分布，从而生成能够反映预期流量且具有词汇、句法和语义多样性的问题集。该工具在领域特定和通用知识语料库上的实验表明，其在生成多样性方面优于现有工具和方法。

链接: https://arxiv.org/abs/2501.12789
作者: Simone Filice,Guy Horowitz,David Carmel,Zohar Karnin,Liane Lewin-Eytan,Yoelle Maarek
机构: Technology Innovation Institute (技术创新研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluating Retrieval-Augmented Generation (RAG) systems, especially in domain-specific contexts, requires benchmarks that address the distinctive requirements of the applicative scenario. Since real data can be hard to obtain, a common strategy is to use LLM-based methods to generate synthetic data. Existing solutions are general purpose: given a document, they generate a question to build a QA pair. However, although the generated questions can be individually good, they are typically not diverse enough to reasonably cover the different ways real end-users can interact with the RAG system. We introduce here DataMorgana, a tool for generating highly customizable and diverse synthetic QA benchmarks tailored to RAG applications. DataMorgana enables detailed configurations of user and question categories and provides control over their distribution within the benchmark. It uses a lightweight two-stage process, ensuring efficiency and fast iterations, while generating benchmarks that reflect the expected traffic. We conduct a thorough line of experiments, showing quantitatively and qualitatively that DataMorgana surpasses existing tools and approaches in producing lexically, syntactically, and semantically diverse question sets across domain-specific and general-knowledge corpora. DataMorgana will be made available to selected teams in the research community, as first beta testers, in the context of the upcoming SIGIR’2025 LiveRAG challenge to be announced in early February 2025.
zh

[NLP-22] Regularization Semi-supervision and Supervision for a Plausible Attention-Based Explanation

【速读】：该论文试图解决注意力机制（Attention Mechanism）在自然语言处理（NLP）任务中生成的注意力图（Attention Map）的解释性问题，特别是其对于普通用户的可理解性和接受度（plausibility of the explanation）。尽管注意力图常被用作模型输出的解释工具，但研究表明，RNN编码器中的注意力权重往往分散在输入标记上，导致其解释性较差。为此，论文提出了三种额外的约束条件来改进注意力图的可解释性：1）通过正则化（regularization）增加注意力权重的稀疏性；2）通过半监督学习（semi-supervision）使用启发式方法监督注意力图；3）通过人工标注（human annotation）进行监督。实验结果表明，这些技术在不同程度上提高了注意力图的可解释性。此外，研究还发现，特定的人工标注指令可能对分类性能产生负面影响。最后，论文强调，无论约束条件如何带来增益，上下文层（contextualization layer）在寻找合理的标记空间方面起着关键作用。

链接: https://arxiv.org/abs/2501.12775
作者: Duc Hau Nguyen,Cyrielle Mallart,Guillaume Gravier,Pascale Sébillot
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention mechanism is contributing to the majority of recent advances in machine learning for natural language processing. Additionally, it results in an attention map that shows the proportional influence of each input in its decision. Empirical studies postulate that attention maps can be provided as an explanation for model output. However, it is still questionable to ask whether this explanation helps regular people to understand and accept the model output (the plausibility of the explanation). Recent studies show that attention weights in the RNN encoders are hardly plausible because they spread on input tokens. We thus propose 3 additional constraints to the learning objective function to improve the plausibility of the attention map: regularization to increase the attention weight sparsity, semi-supervision to supervise the map by a heuristic and supervision by human annotation. Results show that all techniques can improve the attention map plausibility at some level. We also observe that specific instructions for human annotation might have a negative effect on classification performance. Beyond the attention map, the result of experiments on text classification tasks also shows that no matter how the constraint brings the gain, the contextualization layer plays a crucial role in finding the right space for finding plausible tokens.
zh

[NLP-23] LLM s as Repositories of Factual Knowledge: Limitations and Solutions

【速读】：该论文探讨了大型语言模型（LLMs）作为事实知识库的适用性问题，特别是其在处理时间敏感的事实性问题时的准确性和一致性。由于LLMs的知识来源是包含不同时间戳和媒体类型（如维基、社交媒体等）的数据快照，这些非结构化知识会随时间更新而变化，且不同信息源之间可能存在不一致和不准确的情况。这导致模型在训练或推理时对实体的知识可能受到干扰，进而影响其性能。论文通过评估24种最先进的LLMs（包括封闭、部分开放和完全开放的模型），分析了其在回答时间敏感问题时的可靠性和一致性，并进一步评估了现有方法在提高模型准确性和一致性方面的有效性。关键解决方案是提出了“实体感知微调”（ENtity-Aware Fine-tuning, ENAF），这是一种软神经符号方法，旨在通过微调过程中提供实体的结构化表示来提升模型性能。

链接: https://arxiv.org/abs/2501.12774
作者: Seyed Mahed Mousavi,Simone Alghisi,Giuseppe Riccardi
机构: Signals and Interactive Systems Lab, Department of Information Engineering and Computer Science, University of Trento, Italy(特伦托大学信息工程与计算机科学系信号与交互系统实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs’ sources of knowledge are data snapshots containing factual information about entities collected at different timestamps and from different media types (e.g. wikis, social media, etc.). Such unstructured knowledge is subject to change due to updates through time from past to present. Equally important are the inconsistencies and inaccuracies occurring in different information sources. Consequently, the model’s knowledge about an entity may be perturbed while training over the sequence of snapshots or at inference time, resulting in inconsistent and inaccurate model performance. In this work, we study the appropriateness of Large Language Models (LLMs) as repositories of factual knowledge. We consider twenty-four state-of-the-art LLMs that are either closed-, partially (weights), or fully (weight and training data) open-source. We evaluate their reliability in responding to time-sensitive factual questions in terms of accuracy and consistency when prompts are perturbed. We further evaluate the effectiveness of state-of-the-art methods to improve LLMs’ accuracy and consistency. We then propose “ENtity-Aware Fine-tuning” (ENAF), a soft neurosymbolic approach aimed at providing a structured representation of entities during fine-tuning to improve the model’s performance.
zh

[NLP-24] NExtLong: Toward Effective Long-Context Training without Long Documents

【速读】：该论文试图解决大语言模型（LLMs）在处理长上下文窗口时面临的挑战，尤其是由于长文档稀缺导致的模型在长距离依赖建模上的不足。现有的方法倾向于通过合成长上下文数据来解决这一问题，但缺乏明确的机制来增强长距离依赖建模。为了解决这一局限性，论文提出了NExtLong框架，通过负文档扩展（Negative document Extension）来合成长上下文数据。NExtLong的关键在于将文档分解为多个元块（meta-chunks），并通过从预训练语料库中检索的硬负干扰项（hard negative distractors）进行交错扩展。这种方法迫使模型区分长距离依赖的上下文与干扰内容，从而增强其长距离依赖建模能力。实验结果表明，NExtLong在HELMET和RULER基准测试上相比现有的长上下文合成方法和领先模型（基于非合成长文档训练）取得了显著的性能提升，证明了其在减少对非合成长文档依赖方面的有效性。

链接: https://arxiv.org/abs/2501.12766
作者: Chaochen Gao,Xing Wu,Zijia Lin,Debing Zhang,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); Xiaohongshu Inc(小红书公司); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Corresponding authors: xing wu, and songlin hu

点击查看摘要

Abstract:Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong’s ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
zh

[NLP-25] EvidenceMap: Unleashing the Power of Small Language Models with Evidence Analysis for Biomedical Question Answering

【速读】：该论文试图解决在生物医学领域中，基于大型语言模型（LLM）的问答系统在处理专业问题时，缺乏对多源证据的多方面关系进行显式分析的问题。现有的方法主要依赖模型的内部推理能力或外部知识的引入，但未能充分模拟人类在解决专业问题时对证据的深入分析和逻辑关联的构建。为此，论文提出了一种名为EvidenceMap的新型生成式问答框架，其关键解决方案是通过小型语言模型（SLMs）显式学习和整合证据分析。该框架为每个问题构建证据图，并利用SLM生成支持性评估、逻辑关联和相关证据的总结表示，从而在自回归方式下增强另一个SLM的分析生成能力。实验表明，引入证据分析学习过程显著优于更大的模型和流行的LLM推理方法。

链接: https://arxiv.org/abs/2501.12746
作者: Chang Zong,Jian Wan,Lei Zhang
机构: School of Information and Electronic Engineering, Zhejiang University of Science and Technology (浙江科技学院信息与电子工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Current LLM-based approaches improve question answering performance by leveraging the internal reasoning abilities of models or incorporating external knowledge. However, when humans address professional problems, it is essential to explicitly analyze the multifaceted relationships from multiple pieces and diverse sources of evidence to achieve better answers. In this study, we propose a novel generative question answering framework for the biomedical domain, named EvidenceMap, which explicitly learns and incorporates evidence analysis with small language models (SLMs). The framework describes an evidence map for each question and fully utilizes an SLM to derive the representation of the supportive evaluation, the logical correlation, and the summarization of the related evidence, which facilitates an analysis-augmented generation with another SLM in an autoregressive way. Extensive experiments have shown that introducing an evidence analysis learning process can significantly outperform larger models and popular LLM reasoning methods.
zh

[NLP-26] raining Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression ICASSP2025

【速读】：该论文旨在解决如何提升对话系统在对话过程中的用户参与度问题，特别是通过改进单个对话响应以及整个对话的一致性、个性和共情等对话印象。解决方案的关键在于利用基于大语言模型（LLMs）的强化学习从AI反馈（RLAIF）中获取训练信号，以对齐基于LLM的对话模型。具体而言，研究采用了监督微调（SFT）技术，准备了与12个对话印象相关指标的奖励模型，用于评估对话响应。通过使用这些奖励模型的信号作为反馈，对话模型得以优化，从而提升对话印象。实验结果表明，使用该奖励模型进行调优后，对话响应的自然度和各项指标的评估均有所提升。

链接: https://arxiv.org/abs/2501.12698
作者: Kai Yoshida,Masahiro Mizukami,Seiya Kawano,Canasai Kruengkrai,Hiroaki Sugiyama,Koichiro Yoshino
机构: Nara Institute of Science and Technology, Japan(奈良先端科学技术大学院大学, 日本); Guardian Robot Project, RIKEN, Japan(守护机器人项目, 理化学研究所, 日本); NTT Communication Science Laboratories, Japan(NTT通信科学实验室, 日本); Institute of Science Tokyo, Japan(东京科学研究所, 日本)
类目: Computation and Language (cs.CL)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:To improve user engagement during conversations with dialogue systems, we must improve individual dialogue responses and dialogue impressions such as consistency, personality, and empathy throughout the entire dialogue. While such dialogue systems have been developing rapidly with the help of large language models (LLMs), reinforcement learning from AI feedback (RLAIF) has attracted attention to align LLM-based dialogue models for such dialogue impressions. In RLAIF, a reward model based on another LLM is used to create a training signal for an LLM-based dialogue model using zero-shot/few-shot prompting techniques. However, evaluating an entire dialogue only by prompting LLMs is challenging. In this study, the supervised fine-tuning (SFT) of LLMs prepared reward models corresponding to 12 metrics related to the impression of the entire dialogue for evaluating dialogue responses. We tuned our dialogue models using the reward model signals as feedback to improve the impression of the system. The results of automatic and human evaluations showed that tuning the dialogue model using our reward model corresponding to dialogue impression improved the evaluation of individual metrics and the naturalness of the dialogue response.
zh

[NLP-27] Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation COLING2025

【速读】：该论文试图解决在低资源环境下使用大规模多语言Transformer（Massively Multilingual Transformers, MMTs）时所面临的权衡问题，特别是模型规模和效率之间的权衡。论文提出的解决方案是通过简单的知识蒸馏（knowledge distillation）方法，从大规模多语言Transformer中提取出更小、更高效的单语言Transformer模型。以他加禄语（Tagalog）为例，研究表明这些单语言模型在多种基准任务中表现与强基线模型相当，且效率更高。此外，论文还探讨了在蒸馏过程中通过改进目标语言的软监督（soft-supervision）来进一步提升模型性能，并通过一系列分析和消融实验验证了所提方法的有效性。

链接: https://arxiv.org/abs/2501.12660
作者: Jan Christian Blaise Cruz,Alham Fikri Aji
机构: MBZUAI(穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: LoResLM Workshop @ COLING 2025

点击查看摘要

Abstract:In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of benchmark tasks in a much more efficient manner. Furthermore, we investigate additional steps during the distillation process that improves the soft-supervision of the target language, and provide a number of analyses and ablations to show the efficacy of the proposed method.
zh

[NLP-28] he potential – and the pitfalls – of using pre-trained language models as cognitive science theories

【速读】：该论文探讨了预训练语言模型（Pre-trained Language Models, PLMs）在认知科学和发展科学中的应用问题，特别是如何将其性能与人类认知发展过程进行对齐。论文的核心问题在于，尽管PLMs在多个认知领域表现出与成人认知的对应性，但其作为认知科学理论的适用性仍面临诸多挑战，包括模型架构的差异、训练数据模态和规模的多样性，以及模型可解释性的限制。论文的解决方案关键在于将PLMs视为认知科学和发展科学的模型，而非单纯的工程产物，并总结了研究者如何将PLMs的性能指标映射到人类认知性能指标的假设。此外，论文还指出了这种方法的潜在缺陷，并提出了将PLMs作为认知和认知发展的可信解释所需满足的标准。

链接: https://arxiv.org/abs/2501.12651
作者: Raj Sanjay Shah,Sashank Varma
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many studies have evaluated the cognitive alignment of Pre-trained Language Models (PLMs), i.e., their correspondence to adult performance across a range of cognitive domains. Recently, the focus has expanded to the developmental alignment of these models: identifying phases during training where improvements in model performance track improvements in children’s thinking over development. However, there are many challenges to the use of PLMs as cognitive science theories, including different architectures, different training data modalities and scales, and limited model interpretability. In this paper, we distill lessons learned from treating PLMs, not as engineering artifacts but as cognitive science and developmental science models. We review assumptions used by researchers to map measures of PLM performance to measures of human performance. We identify potential pitfalls of this approach to understanding human thinking, and we end by enumerating criteria for using PLMs as credible accounts of cognition and cognitive development.
zh

[NLP-29] Dynamics of Toxicity in Political Podcasts

【速读】：该论文旨在解决数字媒体中日益增长的毒性（toxicity）问题，特别是在快速发展的播客（podcast）领域。通过分析美国30多个热门政治播客的转录数据，论文研究了毒性言论的出现和传播，重点关注播客转录中的对话链（conversation chains）结构。解决方案的关键包括：（1）创建了一个全面的转录和标注的政治播客数据集，并使用Google的Perspective API识别了数千个毒性实例；（2）揭示了大多数播客集数中至少包含一个毒性实例的令人担忧的趋势；（3）引入了毒性对话链的概念，并分析了其结构和语言特性，揭示了与愤怒和烦恼相关的长时间、重复模式、比喻性语言和情感线索；（4）识别了“want”、“like”和“know”等需求相关词汇作为毒性的前兆；（5）开发了基于标注变化点的预测模型，以预测毒性的转变。这些发现为播客毒性的研究提供了关键见解，并为未来实时监控和干预机制的开发奠定了基础，以促进这一有影响力媒体中的健康讨论。

链接: https://arxiv.org/abs/2501.12640
作者: Naquee Rizwan,Nayandeep Deb,Sarthak Roy,Vishwajeet Singh Solanki,Kiran Garimella,Animesh Mukherjee
机构: Indian Institute of Technology Kharagpur (印度理工学院卡拉格普尔); Rutgers School of Communication and Information (罗格斯大学传播与信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Toxicity in digital media poses significant challenges, yet little attention has been given to its dynamics within the rapidly growing medium of podcasts. This paper addresses this gap by analyzing political podcast data to study the emergence and propagation of toxicity, focusing on conversation chains-structured reply patterns within podcast transcripts. Leveraging state-of-the-art transcription models and advanced conversational analysis techniques, we systematically examine toxic discourse in over 30 popular political podcasts in the United States. Our key contributions include: (1) creating a comprehensive dataset of transcribed and diarized political podcasts, identifying thousands of toxic instances using Google’s Perspective API, (2) uncovering concerning trends where a majority of episodes contain at least one toxic instance, (3) introducing toxic conversation chains and analyzing their structural and linguistic properties, revealing characteristics such as longer durations, repetitive patterns, figurative language, and emotional cues tied to anger and annoyance, (4) identifying demand-related words like ‘want’, ‘like’, and ‘know’ as precursors to toxicity, and (5) developing predictive models to anticipate toxicity shifts based on annotated change points. Our findings provide critical insights into podcast toxicity and establish a foundation for future research on real-time monitoring and intervention mechanisms to foster healthier discourse in this influential medium.
zh

[NLP-30] Distillation Quantification for Large Language Models

【速读】：该论文试图解决模型蒸馏（Model Distillation）过程中可能导致模型同质化（homogenization）的问题，以及如何系统量化蒸馏过程及其影响。模型蒸馏是一种将知识从大型语言模型（LLMs）转移到较小模型的技术，旨在创建资源高效且性能优异的模型。然而，过度的蒸馏可能导致模型多样性减少，进而影响其处理复杂或新颖任务的鲁棒性。论文提出的解决方案包括两个关键方面：(1) 通过识别身份认知矛盾（identity cognition contradictions）来评估模型在感知和表示身份相关信息时的差异；(2) 通过分析模型间的多粒度响应相似性（multi-granularity response similarities）来衡量同质化的程度。实验结果表明，知名闭源和开源LLMs通常表现出较高的蒸馏程度，而基础LLMs比对齐LLMs显示出更高的蒸馏程度。该框架旨在提高LLM数据蒸馏的透明度，并呼吁开发更具独立性和透明技术报告的LLMs，以增强其鲁棒性和安全性。

链接: https://arxiv.org/abs/2501.12619
作者: Sunbowen Lee,Junting Zhou,Chang Ao,Kaige Li,Xinrun Du,Sirui He,Jiaheng Liu,Min Yang,Zhoufutu Wen,Shiwen Ni
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Peking University (北京大学); 01.AI; SUSTech (南方科技大学); SUAT; Leibowitz AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model distillation is a technique for transferring knowledge from large language models (LLMs) to smaller ones, aiming to create resource-efficient yet high-performing models. However, excessive distillation can lead to homogenization, reducing diversity among models and impairing their ability to robustly handle complex or novel tasks. These limitations underscore the need to systematically quantify the distillation process and its impact. In this work, we propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization. Experimental results demonstrate two key insights: (1) Well-known closed-source and open-source LLMs usually exhibit high distillation degrees, except for Claude, Doubao, and Gemini. (2) Base LLMs show higher distillation degrees compared to aligned LLMs. By offering a systematic approach to improve the transparency of LLM data distillation, we call for LLMs with more independent development and more transparent technical reports to improve LLMs’ robustness and safety. The code and data are available under this https URL.
zh

[NLP-31] 2ISafety: Benchmark for Assessing Fairness Toxicity and Privacy in Image Generation

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在生成高质量图像时可能产生的安全性问题，包括生成有害、偏见或隐私内容的风险。目前，评估T2I模型安全性的研究仍处于早期阶段，许多关键风险尚未被充分探索。为解决这一问题，作者提出了T2ISafety，一个用于评估T2I模型安全性的基准测试，涵盖毒性（toxicity）、公平性（fairness）和偏见（bias）三个关键领域。解决方案的关键在于构建了一个详细的层次结构，包含12个任务和44个类别，并精心收集了70K个相关提示（prompts）。基于这一分类和提示集，作者创建了一个包含68K张手动标注图像的大规模T2I数据集，并训练了一个评估器，能够检测出以往工作中未能识别的关键风险，包括即使是超大规模专有模型（如GPTs）也无法正确检测的风险。通过这一方法，作者评估了12个著名的扩散模型，揭示了包括种族公平性、生成有害内容的倾向以及隐私保护方面的显著差异等问题。

链接: https://arxiv.org/abs/2501.12612
作者: Lijun Li,Zhelun Shi,Xuhao Hu,Bowen Dong,Yiran Qin,Xihui Liu,Lu Sheng,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Beihang University (北京航空航天大学); Harbin Institute of Technology (哈尔滨工业大学); Chinese University of Hong Kong (Shenzhen) (香港中文大学(深圳)); Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under this https URL.
zh

[NLP-32] BLR-MoE: Boosted Language-Routing Mixture of Experts for Domain-Robust Multilingual E2E ASR ICASSP2025

【速读】：该论文试图解决多语言自动语音识别（MASR）任务中的语言混淆问题，特别是在不匹配领域场景下，现有的混合专家（Mixture of Expert, MoE）架构（如LR-MoE）仍然面临这一问题。论文将语言混淆问题解耦为自注意力机制（self-attention）和路由器（router）中的混淆。为解决自注意力中的语言混淆，论文在LR-MoE基础上提出了注意力-MoE架构，将MoE不仅应用于前馈网络（FFN），还应用于自注意力机制。此外，为提高基于语言识别（LID）的路由器对语言混淆的鲁棒性，论文提出了专家剪枝（expert pruning）和路由器增强（router augmentation）方法。结合这些改进，论文提出了增强型语言路由MoE（BLR-MoE）架构，并在一个10,000小时的MASR数据集上验证了其有效性。

链接: https://arxiv.org/abs/2501.12602
作者: Guodong Ma,Wenxuan Wang,Lifeng Zhou,Yuting Yang,Yuke Li,Binbin Du
机构: Yidun AI Lab; Netease(网易)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE ICASSP 2025

点击查看摘要

Abstract:Recently, the Mixture of Expert (MoE) architecture, such as LR-MoE, is often used to alleviate the impact of language confusion on the multilingual ASR (MASR) task. However, it still faces language confusion issues, especially in mismatched domain scenarios. In this paper, we decouple language confusion in LR-MoE into confusion in self-attention and router. To alleviate the language confusion in self-attention, based on LR-MoE, we propose to apply attention-MoE architecture for MASR. In our new architecture, MoE is utilized not only on feed-forward network (FFN) but also on self-attention. In addition, to improve the robustness of the LID-based router on language confusion, we propose expert pruning and router augmentation methods. Combining the above, we get the boosted language-routing MoE (BLR-MoE) architecture. We verify the effectiveness of the proposed BLR-MoE in a 10,000-hour MASR dataset.
zh

[NLP-33] O1 -Pruner: Length-Harmonizing Fine-Tuning for O1 -Like Reasoning Pruning

【速读】：该论文试图解决长思维推理（long-thought reasoning）大语言模型（LLMs）在推理过程中由于推理步骤冗长导致的推理时间显著增加的问题。尽管这种推理范式显著提升了模型的问题解决能力，但其推理开销较大，尤其是在处理复杂问题时，模型难以根据问题难度和推理冗余有效分配计算资源。为解决这一问题，论文提出了长度协调微调（Length-Harmonizing Fine-Tuning，O1-Pruner）方法，旨在在保持模型准确性的同时最小化推理开销。该方案的关键在于通过预采样估计模型的基线性能，并采用强化学习风格的微调方法，激励模型在准确性约束下生成更短的推理过程，从而实现高效且低冗余的推理。实验结果表明，O1-Pruner不仅显著降低了推理开销，还在多个数学推理基准测试中实现了更高的准确性。

链接: https://arxiv.org/abs/2501.12570
作者: Haotian Luo,Li Shen,Haiying He,Yibo Wang,Shiwei Liu,Wei Li,Naiqiang Tan,Xiaochun Cao,Dacheng Tao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Recently, long-thought reasoning LLMs, such as OpenAI’s O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model’s problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM’s baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at this https URL
zh

[NLP-34] Understanding the LLM -ification of CHI: Unpacking the Impact of LLM s at CHI through a Systematic Literature Review

【速读】：该论文试图解决大语言模型（LLMs）在人机交互（HCI）领域中的应用现状及其影响的问题。尽管LLMs被认为将彻底改变HCI领域，但目前对其在HCI中的实际应用情况缺乏系统性的理解。为此，论文通过对2020年至2024年间153篇CHI论文的系统性文献综述，填补了这一研究空白。关键解决方案包括对LLMs在HCI中的应用领域、角色、贡献类型以及局限性和风险进行分类和总结。研究发现，LLMs在10个不同的领域中得到了应用，主要通过实证研究和工具开发做出贡献。此外，作者们提出了关于LLMs在研究中有效性和可重复性的担忧，并呼吁未来研究应更多地关注开放模型的使用。论文还提出了改进HCI研究的建议，并为研究人员提供了评估LLMs相关工作的有效性和适用性的指导性问题。

链接: https://arxiv.org/abs/2501.12557
作者: Rock Yuren Pang,Hope Schroeder,Kynnedy Simone Smith,Solon Barocas,Ziang Xiao,Emily Tseng,Danielle Bragg
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This is a preprint version of the paper conditionally accepted to CHI’25

点击查看摘要

Abstract:Large language models (LLMs) have been positioned to revolutionize HCI, by reshaping not only the interfaces, design patterns, and sociotechnical systems that we study, but also the research practices we use. To-date, however, there has been little understanding of LLMs’ uptake in HCI. We address this gap via a systematic literature review of 153 CHI papers from 2020-24 that engage with LLMs. We taxonomize: (1) domains where LLMs are applied; (2) roles of LLMs in HCI projects; (3) contribution types; and (4) acknowledged limitations and risks. We find LLM work in 10 diverse domains, primarily via empirical and artifact contributions. Authors use LLMs in five distinct roles, including as research tools or simulated users. Still, authors often raise validity and reproducibility concerns, and overwhelmingly study closed models. We outline opportunities to improve HCI research with and on LLMs, and provide guiding questions for researchers to consider the validity and appropriateness of LLM-related work.
zh

[NLP-35] Human-like conceptual representations emerge from language prediction

【速读】：该论文试图解决的核心问题是如何在人类认知中表示和组织概念（concepts），这一问题对于理解人类认知的本质至关重要。论文通过重新定义经典的反向词典任务（reverse dictionary task），模拟了人类在上下文中的概念推理过程，并研究了在大语言模型（LLMs）中人类类似概念表示的涌现。关键解决方案在于利用LLMs从定义性描述中推断概念，并构建出趋近于共享的、上下文无关的表示空间。这些表示不仅有效预测了人类的行为判断，还与人类大脑中的神经活动模式高度一致，从而提供了生物学上的合理性证据。研究结果表明，即使没有现实世界的物理基础，人类类似的概念表示和组织也可以从语言预测中自然涌现。这一发现支持了LLMs作为理解复杂人类认知的有价值工具的观点，并为人工智能与人类智能的更好对齐铺平了道路。

链接: https://arxiv.org/abs/2501.12547
作者: Ningyu Xu,Qi Zhang,Chao Du,Qiang Luo,Xipeng Qiu,Xuanjing Huang,Menghan Zhang
机构: Fudan University(复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) provide a new opportunity to address the long-standing question of how concepts are represented and organized in the mind, which is central to unravelling the nature of human cognition. Here, we reframed the classic reverse dictionary task to simulate human concept inference in context and investigated the emergence of human-like conceptual representations within LLMs. We found that LLMs were able to infer concepts from definitional descriptions and construct representation spaces that converge towards a shared, context-independent structure. These representations effectively predicted human behavioural judgments and aligned well with neural activity patterns in the human brain, offering evidence for biological plausibility. These findings demonstrate that human-like conceptual representations and organization can naturally emerge from language prediction, even without real-world grounding. Our work supports the view that LLMs serve as valuable tools for understanding complex human cognition and paves the way for better alignment between artificial and human intelligence.
zh

[NLP-36] Comparative Approaches to Sentiment Analysis Using Datasets in Major European and Arabic Languages

【速读】：该论文探讨了基于Transformer的模型（如BERT、mBERT和XLM-R）在多语言情感分析中的应用，特别是在处理具有复杂形态结构的语言时的表现。论文的核心问题是如何提高在多语言环境下，尤其是形态复杂语言中的情感分类准确性。解决方案的关键在于识别出XLM-R模型在形态复杂语言中的优越适应性，并通过微调策略（fine-tuning strategies）显著提升情感分类的准确性，尤其是在资源较少的语言中。研究结果表明，XLM-R在这些语言中的准确率超过了88%，突显了其在多语言情感分析中的潜力。

链接: https://arxiv.org/abs/2501.12540
作者: Mikhail Krasitskii,Olga Kolesnikova,Liliana Chanona Hernandez,Grigori Sidorov,Alexander Gelbukh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11th International Conference on Advances in Computer Science and Information Technology (ACSTY 2025)

点击查看摘要

Abstract:This study explores transformer-based models such as BERT, mBERT, and XLM-R for multi-lingual sentiment analysis across diverse linguistic structures. Key contributions include the identification of XLM-R superior adaptability in morphologically complex languages, achieving accuracy levels above 88%. The work highlights fine-tuning strategies and emphasizes their significance for improving sentiment classification in underrepresented languages.
zh

[NLP-37] Compositional Instruction Following with Language Models and Reinforcement Learning

【速读】：该论文试图解决在强化学习（Reinforcement Learning）与语言基础（Language Grounding）结合时，智能体（agent）在探索环境的同时需要学习多个语言条件任务（language-conditioned tasks）的挑战。为了解决这一问题，作者提出了一种新方法：组合式强化学习语言智能体（Compositionally-Enabled Reinforcement Learning Language Agent, CERLLA）。该方法的关键在于通过利用组合式策略表示（compositional policy representations）和基于强化学习与上下文学习（in-context learning）训练的语义解析器（semantic parser），显著降低了语言指定任务的样本复杂度（sample complexity）。实验结果表明，CERLLA 在 162 个设计用于测试组合泛化（compositional generalization）的任务中，样本复杂度显著优于非组合基线（non-compositional baseline），并且在更少的步骤内达到更高的成功率，最终达到了 92% 的成功率，接近预言策略（oracle policy）的上限性能，而基线方法在相同环境步骤下仅达到 80% 的成功率。

链接: https://arxiv.org/abs/2501.12539
作者: Vanya Cohen,Geraud Nangue Tasse,Nakul Gopalan,Steven James,Matthew Gombolay,Ray Mooney,Benjamin Rosman
机构: The University of Texas at Austin(德克萨斯大学奥斯汀分校); University of the Witwatersrand(威特沃特斯兰德大学); Arizona State University(亚利桑那州立大学); Georgia Institute of Technology(乔治亚理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: TMLR 2024

点击查看摘要

Abstract:Combining reinforcement learning with language grounding is challenging as the agent needs to explore the environment while simultaneously learning multiple language-conditioned tasks. To address this, we introduce a novel method: the compositionally-enabled reinforcement learning language agent (CERLLA). Our method reduces the sample complexity of tasks specified with language by leveraging compositional policy representations and a semantic parser trained using reinforcement learning and in-context learning. We evaluate our approach in an environment requiring function approximation and demonstrate compositional generalization to novel tasks. Our method significantly outperforms the previous best non-compositional baseline in terms of sample complexity on 162 tasks designed to test compositional generalization. Our model attains a higher success rate and learns in fewer steps than the non-compositional baseline. It reaches a success rate equal to an oracle policy’s upper-bound performance of 92%. With the same number of environment steps, the baseline only reaches a success rate of 80%.
zh

[NLP-38] Academic Case Reports Lack Diversity: Assessing the Presence and Diversity of Sociodemographic and Behavioral Factors related with Post COVID-19 Condition

【速读】：该论文旨在解决COVID-19后遗症（Post COVID-19 Condition, PCC）在脆弱人群中的流行率、差异性和症状变化问题，特别是如何将健康的社会决定因素（Social Determinants of Health, SDOH）整合到PCC研究中，以改善护理并解决交叉不平等问题。解决方案的关键在于利用自然语言处理（NLP）技术，通过构建PCC病例报告语料库（PCC Case Report Corpus），并结合命名实体识别（NER）、自然语言推理（NLI）、三元组分析和频率分析等技术，提取和分析SDOH相关实体。研究采用了预训练的NER模型、人工审查和数据增强来提高实体类型的质量、多样性和代表性。实验表明，经过微调的BERT模型在处理不同句子结构和稀疏类别时优于传统的基于RNN的模型。通过探索性分析，研究揭示了实体丰富度的变异性，并识别了高频共现的实体组合，如年龄、性别和病情。NLI分析进一步揭示了某些属性（如“经历暴力或虐待”和“有医疗保险”）的高蕴含率，以及其他属性（如“女性身份”、“已婚”和“患有绝症”）的高矛盾率。

链接: https://arxiv.org/abs/2501.12538
作者: Juan Andres Medina Florez,Shaina Raza,Rashida Lynn,Zahra Shakeri,Brendan T. Smith,Elham Dolatabadi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding the prevalence, disparities, and symptom variations of Post COVID-19 Condition (PCC) for vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research by leveraging NLP techniques to analyze disparities and variations in SDOH representation within PCC case reports. Following construction of a PCC Case Report Corpus, comprising over 7,000 case reports from the LitCOVID repository, a subset of 709 reports were annotated with 26 core SDOH-related entity types using pre-trained named entity recognition (NER) models, human review, and data augmentation to improve quality, diversity and representation of entity types. An NLP pipeline integrating NER, natural language inference (NLI), trigram and frequency analyses was developed to extract and analyze these entities. Both encoder-only transformer models and RNN-based models were assessed for the NER objective. Fine-tuned encoder-only BERT models outperformed traditional RNN-based models in generalizability to distinct sentence structures and greater class sparsity. Exploratory analysis revealed variability in entity richness, with prevalent entities like condition, age, and access to care, and underrepresentation of sensitive categories like race and housing status. Trigram analysis highlighted frequent co-occurrences among entities, including age, gender, and condition. The NLI objective (entailment and contradiction analysis) showed attributes like “Experienced violence or abuse” and “Has medical insurance” had high entailment rates (82.4%-80.3%), while attributes such as “Is female-identifying,” “Is married,” and “Has a terminal condition” exhibited high contradiction rates (70.8%-98.5%). Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.12538 [cs.CL] (or arXiv:2501.12538v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.12538 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] Enhancing Privacy in the Early Detection of Sexual Predators Through Federated Learning and Differential Privacy AAAI

【速读】：该论文旨在解决由COVID-19大流行导致的屏幕时间增加和隔离引发的在线诱骗（online grooming）问题，即捕食者通过策略引诱儿童进行性剥削的行为。传统的检测方法通常涉及通过集中训练的模型或向全球服务器发送私人对话来监控私人对话，这引发了隐私问题。本文提出了一种隐私保护的管道，用于早期检测性捕食者。其解决方案的关键在于利用联邦学习（federated learning）和差分隐私（differential privacy）技术，以在保护儿童隐私的同时创建更安全的在线环境。通过广泛的真实数据评估，论文证明了隐私和实用性可以共存，尽管实用性略有下降。

链接: https://arxiv.org/abs/2501.12537
作者: Khaoula Chehbouni,Martine De Cock,Gilles Caporossi,Afaf Taik,Reihaneh Rabbany,Golnoosh Farnadi
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to AAAI-Social Impact Track - Oral

点击查看摘要

Abstract:The increased screen time and isolation caused by the COVID-19 pandemic have led to a significant surge in cases of online grooming, which is the use of strategies by predators to lure children into sexual exploitation. Previous efforts to detect grooming in industry and academia have involved accessing and monitoring private conversations through centrally-trained models or sending private conversations to a global server. In this work, we implement a privacy-preserving pipeline for the early detection of sexual predators. We leverage federated learning and differential privacy in order to create safer online spaces for children while respecting their privacy. We investigate various privacy-preserving implementations and discuss their benefits and shortcomings. Our extensive evaluation using real-world data proves that privacy and utility can coexist with only a slight reduction in utility.
zh

[NLP-40] he Journey Matters: Averag e Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

【速读】：该论文旨在解决大型语言模型（LLMs）在训练和推理过程中计算资源需求过高的问题。通过稀疏预训练（sparse pre-training），即在预训练阶段结合剪枝（pruning）和训练，论文提出了一种简化且高效的解决方案。关键发现包括：在总训练计算量的25%时开始剪枝，并在75%时结束，能够实现接近最优的最终评估损失（evaluation loss）。此外，论文提出了一种新的缩放定律（scaling law），该定律基于预训练期间的平均参数数量，能够准确建模稀疏和密集预训练LLMs的评估损失。研究结果表明，稀疏预训练在相同计算预算下能够达到与密集预训练相同的模型质量，同时显著减少模型大小，从而在推理阶段节省大量计算资源。

链接: https://arxiv.org/abs/2501.12486
作者: Tian Jin,Ahmed Imtiaz Humayun,Utku Evci,Suvinay Subramanian,Amir Yazdanbakhsh,Dan Alistarh,Gintare Karolina Dziugaite
机构: MIT CSAIL(麻省理工学院计算机科学与人工智能实验室); Rice University(莱斯大学); Google Research(谷歌研究院); Google DeepMind(谷歌DeepMind); Google(谷歌); IST Austria(奥地利科学技术研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 17 pages

点击查看摘要

Abstract:Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training–which combines pruning and pre-training into a single phase–provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.
zh

[NLP-41] Owls are wise and foxes are unfaithful: Uncovering animal stereotypes in vision-language models

【速读】：该论文旨在探讨动物刻板印象（animal stereotypes）在视觉-语言模型（vision-language models）中的表现，特别是在图像生成任务中是否延续了这些刻板印象。研究通过特定的提示词（prompts）来测试DALL-E模型是否生成了与“猫头鹰象征智慧”、“狐狸象征不忠”等文化偏见一致的图像。研究结果表明，模型在生成图像时确实存在显著的刻板印象，反映了文化偏见。该研究首次系统性地考察了视觉-语言模型中的动物刻板印象问题，揭示了AI生成视觉内容中一个关键但尚未充分探索的偏见维度。解决方案的关键在于通过系统性的实验设计和分析，揭示模型在生成图像时如何受到文化刻板印象的影响，并为进一步减少AI模型中的偏见提供了实证基础。

链接: https://arxiv.org/abs/2501.12433
作者: Tabinda Aman,Mohammad Nadeem,Shahab Saquib Sohail,Mohammad Anas,Erik Cambria
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Animal stereotypes are deeply embedded in human culture and language. They often shape our perceptions and expectations of various species. Our study investigates how animal stereotypes manifest in vision-language models during the task of image generation. Through targeted prompts, we explore whether DALL-E perpetuates stereotypical representations of animals, such as “owls as wise,” “foxes as unfaithful,” etc. Our findings reveal significant stereotyped instances where the model consistently generates images aligned with cultural biases. The current work is the first of its kind to examine animal stereotyping in vision-language models systematically and to highlight a critical yet underexplored dimension of bias in AI-generated visual content.
zh

[NLP-42] Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation

【速读】：该论文旨在解决当前大型语言模型（LLMs）在执行复杂现实任务时存在的感知范围有限和任务规划能力不足的问题。尽管现有的主流方法（如CoT/ReAct）通过逐步调用工具来与外部环境交互，但这些方法在任务规划和并行处理方面存在局限性。为解决这些问题，论文提出了一种新颖的并行工具调用范式——DTA-Llama（Divide-Then-Aggregate Llama）。其关键解决方案包括：首先，将传统的树状工具搜索路径转化为有向无环图（DAG）结构，生成高质量的并行工具调用数据集；其次，训练DTA-Llama模型以学习将当前任务迭代划分为多个并行工具调用子任务，并聚合调用结果以决定下一步行动；最后，引入基于进程/线程机制的高效推理框架，以在实际任务中应用DTA-Llama。实验结果表明，该方法显著提升了任务性能，同时减少了令牌消耗和推理时间。

链接: https://arxiv.org/abs/2501.12432
作者: Dongsheng Zhu,Weixian Shi,Zhengliang Shi,Zhaochun Ren,Shuaiqiang Wang,Lingyong Yan,Dawei Yin
机构: Baidu. Inc.(百度), Beijing, China; Shandong University(山东大学), Qingdao, China; Leiden University(莱顿大学), Leiden, The Netherlands
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although current Large Language Models (LLMs) exhibit impressive capabilities, performing complex real-world tasks still requires tool learning. Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to interact with external environments, but they are limited in perceptual scope and lack adequate task-planning capability. To address these limitations, other studies introduce the first Search-based Decision Tree (DFSDT), which still suffers from the high computational cost. In this paper, we introduce a novel parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama). First, we transform traditional tree-based tool search paths into Directed Acyclic Graph (DAG) structure, generating a high-quality parallel tool invocation dataset. The DTA-Llama is then trained on the dataset to learn to iteratively divide the current task into several parallel tool invocation sub-tasks and aggregate the invocation results to decide the next actions. Furthermore, we introduce an efficient inference framework inspired by the Process/Threads mechanism when applying the DTA-Llama to practical tasks. Experimental results show that our approach substantially enhances task performance while reducing token consumption and inference time. Llama2-7B, using our method, is comparable to the official parallel function calling method of GPT-3.5. The relevant code, dataset, and model weights are available at this https URL
zh

[NLP-43] Modality Interactive Mixture-of-Experts for Fake News Detection

【速读】：该论文试图解决多模态（multimodal）环境下虚假新闻检测的挑战，特别是在文本和图像结合的情况下，现有方法往往忽视模态之间复杂的交互作用。这些交互可能表现为互补、矛盾或独立影响，从而增加了检测的难度。为解决这一问题，论文提出了一种名为“Modality Interactive Mixture-of-Experts for Fake News Detection (MIMoE-FND)”的新型分层混合专家框架。该框架通过显式建模模态交互，利用交互门控机制（interaction gating mechanism）来评估单模态预测一致性（unimodal prediction agreement）和语义对齐（semantic alignment）两个关键方面。MIMoE-FND的分层结构允许针对不同的融合场景设计独特的学习路径，从而适应每种模态交互的独特特性。通过定制化的融合策略，MIMoE-FND在多模态虚假新闻检测中提供了更鲁棒和细致的方法，显著提升了检测的准确性和可解释性。

链接: https://arxiv.org/abs/2501.12431
作者: Yifan Liu,Yaokun Liu,Zelin Li,Ruichen Yao,Yang Zhang,Dong Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by the Proceedings of the ACM Web Conference 2025

点击查看摘要

Abstract:The proliferation of fake news on social media platforms disproportionately impacts vulnerable populations, eroding trust, exacerbating inequality, and amplifying harmful narratives. Detecting fake news in multimodal contexts – where deceptive content combines text and images – is particularly challenging due to the nuanced interplay between modalities. Existing multimodal fake news detection methods often emphasize cross-modal consistency but ignore the complex interactions between text and visual elements, which may complement, contradict, or independently influence the predicted veracity of a post. To address these challenges, we present Modality Interactive Mixture-of-Experts for Fake News Detection (MIMoE-FND), a novel hierarchical Mixture-of-Experts framework designed to enhance multimodal fake news detection by explicitly modeling modality interactions through an interaction gating mechanism. Our approach models modality interactions by evaluating two key aspects of modality interactions: unimodal prediction agreement and semantic alignment. The hierarchical structure of MIMoE-FND allows for distinct learning pathways tailored to different fusion scenarios, adapting to the unique characteristics of each modality interaction. By tailoring fusion strategies to diverse modality interaction scenarios, MIMoE-FND provides a more robust and nuanced approach to multimodal fake news detection. We evaluate our approach on three real-world benchmarks spanning two languages, demonstrating its superior performance compared to state-of-the-art methods. By enhancing the accuracy and interpretability of fake news detection, MIMoE-FND offers a promising tool to mitigate the spread of misinformation, with the potential to better safeguard vulnerable communities against its harmful effects.
zh

[NLP-44] Scopes of Alignment AAAI2025

【速读】：该论文试图解决当前人工智能对齐（AI alignment）研究中过于局限于通用价值观（如帮助性、无害性和诚实性）的问题。作者认为，这种对齐方式过于狭隘，无法充分满足不同应用场景的需求。为此，论文提出了三个关键维度来扩展对齐的范畴：能力（competence）、时效性（transience）和受众（audience）。能力指模型必须具备的知识、技能或行为，以满足其预期用途；时效性涉及模型在不同使用场景下的语义或情景适应性；受众则关注模型服务的对象范围，包括大众、公众、小群体或个体。通过这些维度，论文旨在为超越现有对齐概念的技术和工作流程提供框架。

链接: https://arxiv.org/abs/2501.12405
作者: Kush R. Varshney,Zahra Ashktorab,Djallel Bouneffouf,Matthew Riemer,Justin D. Weisz
机构: IBM Research (IBM研究院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The 2nd International Workshop on AI Governance (AIGOV) held in conjunction with AAAI 2025

点击查看摘要

Abstract:Much of the research focus on AI alignment seeks to align large language models and other foundation models to the context-less and generic values of helpfulness, harmlessness, and honesty. Frontier model providers also strive to align their models with these values. In this paper, we motivate why we need to move beyond such a limited conception and propose three dimensions for doing so. The first scope of alignment is competence: knowledge, skills, or behaviors the model must possess to be useful for its intended purpose. The second scope of alignment is transience: either semantic or episodic depending on the context of use. The third scope of alignment is audience: either mass, public, small-group, or dyadic. At the end of the paper, we use the proposed framework to position some technologies and workflows that go beyond prevailing notions of alignment.
zh

[NLP-45] FinSphere: A Conversational Stock Analysis Agent Equipped with Quantitative Tools based on Real-Time Database

【速读】：该论文试图解决当前金融领域大语言模型（LLMs）在股票分析中的两个关键局限性：一是缺乏深度分析能力，导致无法生成专业级的洞察；二是缺乏客观的评估指标来衡量股票分析报告的质量。为解决这些问题，论文提出了FinSphere，一个对话式股票分析代理，并引入了三个主要贡献：（1）Stocksis，一个由行业专家策划的数据集，用于增强LLMs的股票分析能力；（2）AnalyScore，一个系统化的评估框架，用于评估股票分析报告的质量；（3）FinSphere，一个能够根据用户查询生成高质量股票分析报告的AI代理。实验表明，FinSphere在分析质量和实际应用性方面均优于通用和特定领域的LLMs，以及现有的基于代理的系统，即使这些系统增强了实时数据访问和少样本指导。通过整合实时数据流、量化工具和指令调优的LLM，该框架显著提升了股票分析的专业性和实用性。

链接: https://arxiv.org/abs/2501.12399
作者: Shijie Han,Changhai Zhou,Yiqing Shen,Tianning Sun,Yuhua Zhou,Xiaoxia Wang,Zhixiao Yang,Jingshu Zhang,Hongguang Li
机构: JF SmartInvest Holdings Ltd; Columbia University(哥伦比亚大学); Fudan University(复旦大学); Johns Hopkins University(约翰霍普金斯大学); University of Nottingham-Ningbo(宁波诺丁汉大学); Zhejiang University(浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:Current financial Large Language Models (LLMs) struggle with two critical limitations: a lack of depth in stock analysis, which impedes their ability to generate professional-grade insights, and the absence of objective evaluation metrics to assess the quality of stock analysis reports. To address these challenges, this paper introduces FinSphere, a conversational stock analysis agent, along with three major contributions: (1) Stocksis, a dataset curated by industry experts to enhance LLMs’ stock analysis capabilities, (2) AnalyScore, a systematic evaluation framework for assessing stock analysis quality, and (3) FinSphere, an AI agent that can generate high-quality stock analysis reports in response to user queries. Experiments demonstrate that FinSphere achieves superior performance compared to both general and domain-specific LLMs, as well as existing agent-based systems, even when they are enhanced with real-time data access and few-shot guidance. The integrated framework, which combines real-time data feeds, quantitative tools, and an instruction-tuned LLM, yields substantial improvements in both analytical quality and practical applicability for real-world stock analysis.
zh

[NLP-46] Explainable Lane Change Prediction for Near-Crash Scenarios Using Knowledge Graph Embeddings and Retrieval Augmented Generation

【速读】：该论文旨在解决预测高风险车道变换（risky lane changes）的问题，特别是在突发或危险情况下，这类行为是导致交通事故的重要原因。现有的研究主要集中于预测安全车道变换，且事故数据集通常仅基于图像，缺乏全面的传感器数据。为此，作者提出了一个基于CRASH数据集（专门用于高风险车道变换的自有数据集）和HighD数据集（用于安全车道变换）的解决方案。关键方法包括利用知识图谱（KG, Knowledge Graph）和贝叶斯推断（Bayesian inference）结合语言上下文信息来预测车道变换行为，从而增强模型的解释性和透明度。该模型在高风险车道变换预测中达到了91.5%的F1分数，并在CARLA模拟器中验证了其有效性，能够提前四秒预测突发车道变换，为自动驾驶车辆提供更多时间规划安全反应。此外，作者还使用检索增强生成（RAG, Retrieval-Augmented Generation）技术为预测结果提供清晰的自然语言解释，进一步提升了模型的可解释性。

链接: https://arxiv.org/abs/2501.11560
作者: M. Manzour,A. Ballardini,R. Izquierdo,M. Á. Sotelo
机构: Department of Computer Engineering, University of Alcalá, Madrid, Spain (计算机工程系，阿尔卡拉大学，马德里，西班牙)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Lane-changing maneuvers, particularly those executed abruptly or in risky situations, are a significant cause of road traffic accidents. However, current research mainly focuses on predicting safe lane changes. Furthermore, existing accident datasets are often based on images only and lack comprehensive sensory data. In this work, we focus on predicting risky lane changes using the CRASH dataset (our own collected dataset specifically for risky lane changes), and safe lane changes (using the HighD dataset). Then, we leverage KG and Bayesian inference to predict these maneuvers using linguistic contextual information, enhancing the model’s interpretability and transparency. The model achieved a 91.5% f1-score with anticipation time extending to four seconds for risky lane changes, and a 90.0% f1-score for predicting safe lane changes with the same anticipation time. We validate our model by integrating it into a vehicle within the CARLA simulator in scenarios that involve risky lane changes. The model managed to anticipate sudden lane changes, thus providing automated vehicles with further time to plan and execute appropriate safe reactions. Finally, to enhance the explainability of our model, we utilize RAG to provide clear and natural language explanations for the given prediction.
zh

[NLP-47] LEGO-GraphRAG : Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

【速读】：该论文旨在解决GraphRAG（图增强生成式问答系统）在现有研究中缺乏模块化工作流分析、系统性解决方案框架和深入实证研究的问题。GraphRAG通过将知识图谱（knowledge graphs）与大型语言模型（LLMs）结合，提升了推理准确性和上下文相关性，但其应用仍面临上述挑战。为此，作者提出了LEGO-GraphRAG，一个模块化框架，其关键解决方案包括：1）对GraphRAG工作流进行细粒度分解，2）系统分类现有技术和已实现的GraphRAG实例，3）支持创建新的GraphRAG实例。该框架通过在大规模真实世界图谱和多样化查询集上进行全面实证研究，揭示了在推理质量、运行时效率以及计算资源（如token或GPU成本）之间取得平衡的关键见解，为构建更先进的GraphRAG系统提供了重要支持。

链接: https://arxiv.org/abs/2411.05844
作者: Yukun Cao,Zengyi Gao,Zhiyang Li,Xike Xie,Kevin Zhou,Jianliang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:GraphRAG integrates (knowledge) graphs with large language models (LLMs) to improve reasoning accuracy and contextual relevance. Despite its promising applications and strong relevance to multiple research communities, such as databases and natural language processing, GraphRAG currently lacks modular workflow analysis, systematic solution frameworks, and insightful empirical studies. To bridge these gaps, we propose LEGO-GraphRAG, a modular framework that enables: 1) fine-grained decomposition of the GraphRAG workflow, 2) systematic classification of existing techniques and implemented GraphRAG instances, and 3) creation of new GraphRAG instances. Our framework facilitates comprehensive empirical studies of GraphRAG on large-scale real-world graphs and diverse query sets, revealing insights into balancing reasoning quality, runtime efficiency, and token or GPU cost, that are essential for building advanced GraphRAG systems.
zh

计算机视觉

[CV-0] Accelerate High-Quality Diffusion Models with Inner Loop Feedback

【速读】：该论文旨在解决扩散模型（diffusion models）推理过程中计算效率低下的问题。现有的优化方法通常关注在极少数步骤（1-4步）内生成可接受的图像质量，而本文的重点是在显著减少运行时间的同时，匹配通常在20步内实现的最佳结果。为此，作者提出了一种称为“内环反馈”（Inner Loop Feedback, ILF）的新方法。ILF的核心思想是通过训练一个轻量级模块来预测去噪过程中的未来特征，利用扩散模型主干（diffusion backbone）在给定时间步的输出。该方法基于两个关键直觉：(1) 相邻时间步的同一模块输出相似，(2) 部分计算比完全跳过某一步骤对模型的负担更小。ILF的灵活性体现在反馈模块可以直接使用扩散主干的某个模块，并通过可学习的缩放因子（scaling factor）调节其对扩散过程的影响。训练过程中，扩散主干被冻结，仅训练反馈模块，使用蒸馏损失（distillation losses）进行优化。实验表明，ILF在扩散变换器（DiT）和基于DiT的PixArt-alpha、PixArt-sigma模型中均表现出色，实现了1.7x-1.8x的加速，并通过FID、CLIP分数、CLIP图像质量评估、ImageReward等指标验证了其生成质量。

链接: https://arxiv.org/abs/2501.13107
作者: Matthew Gwilliam,Han Cai,Di Wu,Abhinav Shrivastava,Zhiyu Cheng
机构: University of Maryland, College Park (马里兰大学帕克分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submission currently under review; 20 pages, 17 figures, 6 tables

点击查看摘要

Abstract:We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models’ inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF’s 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons.
zh

[CV-1] VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

【速读】：该论文旨在解决图像和视频理解中的多模态基础模型（multimodal foundation model）的优化问题，特别是通过提出一种更先进的模型VideoLLaMA3来实现这一目标。解决方案的关键在于其“视觉中心”（vision-centric）的设计理念，具体体现在两个方面：视觉中心的训练范式（vision-centric training paradigm）和视觉中心的框架设计（vision-centric framework design）。在训练范式上，论文强调高质量图像-文本数据对图像和视频理解的重要性，并通过四个训练阶段（视觉中心对齐、视觉-语言预训练、多任务微调和视频中心微调）来逐步提升模型性能。在框架设计上，模型通过调整预训练的视觉编码器（vision encoder）来适应不同尺寸的图像输入，并通过减少相似视频帧的视觉标记（vision tokens）数量来提高视频表示的精确性和紧凑性。这些设计使得VideoLLaMA3在图像和视频理解基准测试中表现出色。

链接: https://arxiv.org/abs/2501.13106
作者: Boqiang Zhang,Kehan Li,Zesen Cheng,Zhiqiang Hu,Yuqian Yuan,Guanzheng Chen,Sicong Leng,Yuming Jiang,Hang Zhang,Xin Li,Peng Jin,Wenqi Zhang,Fan Wang,Lidong Bing,Deli Zhao
机构: DAMO Academy, Alibaba Group (达摩院, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BZ, KL, ZC, ZH, YY, GC, SL, YJ, HZ, and XL contributed equally to this project. Code: this https URL

点击查看摘要

Abstract:In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of “vision-centric” is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) vision-centric alignment stage, which warms up the vision encoder and projector; 2) vision-language pretraining stage, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) multi-task fine-tuning stage, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) video-centric fine-tuning, which further improves the model’s capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.
zh

[CV-2] Neural Radiance Fields for the Real World: A Survey

【速读】：该论文旨在解决Neural Radiance Fields (NeRFs) 领域缺乏全面综述的问题，特别是在理论进展、替代表示方法、新兴挑战及其在重建、计算机视觉和机器人等领域的应用方面。论文通过系统性地梳理和总结NeRFs的关键理论进展、替代表示方法以及现有挑战，填补了文献中的空白。解决方案的关键在于对NeRFs的最新创新、应用场景和挑战进行全面的综述，并探讨其在重建、计算机视觉和机器人等领域的影响。此外，论文还通过识别文献中的研究缺口，提出了未来研究的方向。

链接: https://arxiv.org/abs/2501.13104
作者: Wenhui Xiao,Remi Chierchia,Rodrigo Santa Cruz,Xuesong Li,David Ahmedt-Aristizabal,Olivier Salvado,Clinton Fookes,Leo Lebrat
机构: Queensland University of Technology(昆士兰科技大学); CSIRO Data61(CSIRO Data61); CSIRO Agriculture & Food(CSIRO农业与食品); Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have remodeled 3D scene representation since release. NeRFs can effectively reconstruct complex 3D scenes from 2D images, advancing different fields and applications such as scene understanding, 3D content generation, and robotics. Despite significant research progress, a thorough review of recent innovations, applications, and challenges is lacking. This survey compiles key theoretical advancements and alternative representations and investigates emerging challenges. It further explores applications on reconstruction, highlights NeRFs’ impact on computer vision and robotics, and reviews essential datasets and toolkits. By identifying gaps in the literature, this survey discusses open challenges and offers directions for future research.
zh

[CV-3] Robust Representation Consistency Model via Contrastive Denoising

【速读】：该论文旨在解决深度神经网络在对抗性扰动下的鲁棒性（robustness）问题，特别是在大扰动半径（perturbation radii）情况下，现有基于扩散模型（diffusion models）的随机平滑（randomized smoothing）方法表现不佳且计算开销较大的问题。论文的关键解决方案是将生成式建模任务重新定义为潜在空间（latent space）中的判别式任务，通过实例判别（instance discrimination）来对齐时间上相邻的点，从而在扩散轨迹上获得一致的表示。基于这些表示进行微调后，模型能够通过单次预测实现隐式的去噪-分类（denoising-then-classification），显著降低了推理成本。实验结果表明，该方法在多个数据集上实现了最先进的性能，特别是在大扰动半径下，其认证准确率（certified accuracy）比现有方法平均提高了5.3%，最高可达11.6%，同时推理成本平均降低了85倍。

链接: https://arxiv.org/abs/2501.13094
作者: Jiachen Lei,Julius Berner,Jiongxiao Wang,Zhongzhu Chen,Zhongjia Ba,Kui Ren,Jun Zhu,Anima Anandkumar
机构: Zhejiang University(浙江大学); Caltech(加州理工学院); UW–Madison(威斯康星大学麦迪逊分校); Amazon(亚马逊); Shengshu(生数科技); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples before making predictions with a standard classifier. While these methods excel at small perturbation radii, they struggle with larger perturbations and incur a significant computational overhead during inference compared to classical methods. To address this, we reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space. Specifically, we use instance discrimination to achieve consistent representations along the trajectories by aligning temporally adjacent points. After fine-tuning based on the learned representations, our model enables implicit denoising-then-classification via a single prediction, substantially reducing inference costs. We conduct extensive experiments on various datasets and achieve state-of-the-art performance with minimal computation budget during inference. For example, our method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 5.3% on average, with up to 11.6% at larger radii, while reducing inference costs by 85 \times on average. Codes are available at: this https URL.
zh

[CV-4] Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation

【速读】：该论文试图解决现有扩散模型（Diffusion Models）在图像生成任务中通常针对单一任务（如颜色、深度或法线预测）进行训练，而未能充分利用外观（appearance）与几何（geometry）之间的内在关联，导致预测结果不一致的问题。为了解决这一问题，论文提出了一种新颖的图像扩散先验模型Orchid，该模型通过变分自编码器（VAE）将颜色、深度和表面法线编码到潜在空间，并利用潜在扩散模型（LDM）生成这些联合潜在表示。Orchid能够直接从用户提供的文本生成逼真的彩色图像、相对深度和表面法线，并可用于创建图像对齐的部分3D场景。此外，该模型还能够执行图像条件任务（如联合单目深度和法线预测），并在准确性上与专门为这些任务设计的最先进方法相媲美。Orchid的关键在于其学习了一个联合先验，该先验可以零样本（zero-shot）方式作为许多涉及外观与几何纠缠的逆问题的正则化器，例如在稀疏视图下的3D生成任务中展示了其有效性。

链接: https://arxiv.org/abs/2501.13087
作者: Akshay Krishnan,Xinchen Yan,Vincent Casser,Abhijit Kundu
机构: Google DeepMind; Georgia Institute of Technology (佐治亚理工学院); Waymo
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Diffusion models are state-of-the-art for image generation. Trained on large datasets, they capture expressive image priors that have been used for tasks like inpainting, depth, and (surface) normal prediction. However, these models are typically trained for one specific task, e.g., a separate model for each of color, depth, and normal prediction. Such models do not leverage the intrinsic correlation between appearance and geometry, often leading to inconsistent predictions. In this paper, we propose using a novel image diffusion prior that jointly encodes appearance and geometry. We introduce a diffusion model Orchid, comprising a Variational Autoencoder (VAE) to encode color, depth, and surface normals to a latent space, and a Latent Diffusion Model (LDM) for generating these joint latents. Orchid directly generates photo-realistic color images, relative depth, and surface normals from user-provided text, and can be used to create image-aligned partial 3D scenes seamlessly. It can also perform image-conditioned tasks like joint monocular depth and normal prediction and is competitive in accuracy to state-of-the-art methods designed for those tasks alone. Lastly, our model learns a joint prior that can be used zero-shot as a regularizer for many inverse problems that entangle appearance and geometry. For example, we demonstrate its effectiveness in color-depth-normal inpainting, showcasing its applicability to problems in 3D generation from sparse views. Comments: Project webpage: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.13087 [cs.CV] (or arXiv:2501.13087v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.13087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-5] CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization

【速读】：该论文旨在解决在3D牙科模型（3D Intraoral Scans, IOS）中自动识别解剖标志点（anatomical landmarks）的问题。传统方法通常需要先进行牙齿分割（segmentation），再进行标志点检测，这一过程复杂、耗时且依赖专家知识。论文提出了一种名为CHaRNet（Conditioned Heatmap Regression Network）的端到端深度学习模型，直接对输入的点云数据进行标志点检测，避免了传统两阶段方法的局限性。CHaRNet的关键创新在于其四个核心模块：（1）点云编码器（point cloud encoder），（2）带有热图回归头（heatmap regression head）的点云解码器，（3）牙齿存在分类头（teeth presence classification head），以及（4）创新的条件热图回归（Conditioned Heatmap Regression, CHaR）模块。CHaR模块通过结合牙齿存在分类信息，动态调整标志点回归，从而在处理缺失牙齿的复杂牙科模型时提高了检测精度。实验结果表明，CHaRNet在1,214个标注的3D牙科模型数据集上表现出色，平均欧几里得距离误差（Mean Euclidean Distance Error, MEDE）为1.28 mm，平均成功率（Mean Success Ratio, MSR）为82.40%，尤其在处理不规则牙科几何形状时表现优异。该研究不仅提升了3D IOS分析的精度，还简化了正畸治疗的工作流程，推动了计算机辅助治疗规划的发展。

链接: https://arxiv.org/abs/2501.13073
作者: José Rodríguez-Ortega,Siham Tabik
机构: Nemotec; Dept. of Computer Science and Artificial Intelligence, University of Granada (格拉纳达大学计算机科学与人工智能系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Identifying anatomical landmarks in 3D dental models is crucial for orthodontic treatment. Manually placing these key points is complex, time-consuming, and requires expert knowledge. While some machine learning methods have been proposed for automatic tooth landmark detection in 3D Intraoral Scans (IOS), research remains limited, with no fully end-to-end approaches that avoid teeth segmentation. We propose CHaRNet (Conditioned Heatmap Regression Network), the first end-to-end deep learning method for tooth landmark detection in 3D IOS. Unlike traditional two-stage methods that segment teeth before detecting landmarks, CHaRNet directly detects landmarks on the input point cloud. It consists of four key modules: (1) a point cloud encoder, (2) a point cloud decoder with a heatmap regression head, (3) a teeth presence classification head, and (4) the innovative Conditioned Heatmap Regression (CHaR) module. The CHaR module refines landmark regression by leveraging teeth presence classification, enabling dynamic adaptation to cases with missing teeth and improving accuracy in complex dental models. We evaluate CHaRNet using five point cloud learning algorithms to validate the effectiveness of the CHaR module and test it on a clinical dataset of 1,214 annotated 3D dental models. Both the dataset and code will be publicly released to address the lack of open datasets in orthodontics, promote benchmarking, and inspire new research. CHaRNet achieves a Mean Euclidean Distance Error (MEDE) of 1.28 mm and a Mean Success Ratio (MSR) of 82.40%, demonstrating robust performance. Notably, it excels in handling irregular dental geometries, such as models with missing teeth. This end-to-end approach streamlines orthodontic workflows, improves 3D IOS analysis precision, and facilitates efficient computer-assisted treatment planning. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.13073 [cs.CV] (or arXiv:2501.13073v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.13073 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: José Rodríguez Ortega [view email] [v1] Wed, 22 Jan 2025 18:35:57 UTC (10,780 KB)
zh

[CV-6] Robust Body Composition Analysis by Generating 3D CT Volumes from Limited 2D Slices

【速读】：该论文旨在解决使用二维（2D）单层计算机断层扫描（CT）成像进行身体成分分析时，由于空间变异性导致的准确性和鲁棒性不足的问题。为了解决这一问题，论文提出了一种基于潜在扩散模型（Latent Diffusion Model, LDM）的新方法，通过从有限的2D切片生成三维（3D）CT体积来增强身体成分分析的准确性。解决方案的关键在于：首先，使用变分自编码器（Variational Autoencoder）将2D切片映射到潜在表示空间；其次，训练LDM以捕捉这些潜在表示的3D上下文；最后，通过身体部位回归（Body Part Regression）确定获取切片之间的空间位置和距离，从而准确插值中间切片并构建完整的3D体积。实验结果表明，该方法显著降低了误差率，从23.3%降至15.2%，优于传统的2D分析方法。

链接: https://arxiv.org/abs/2501.13071
作者: Lianrui Zuo,Xin Yu,Dingjie Su,Kaiwen Xu,Aravind R. Krishnan,Yihao Liu,Shunxing Bao,Fabien Maldonado,Luigi Ferrucci,Bennett A. Landman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Body composition analysis provides valuable insights into aging, disease progression, and overall health conditions. Due to concerns of radiation exposure, two-dimensional (2D) single-slice computed tomography (CT) imaging has been used repeatedly for body composition analysis. However, this approach introduces significant spatial variability that can impact the accuracy and robustness of the analysis. To mitigate this issue and facilitate body composition analysis, this paper presents a novel method to generate 3D CT volumes from limited number of 2D slices using a latent diffusion model (LDM). Our approach first maps 2D slices into a latent representation space using a variational autoencoder. An LDM is then trained to capture the 3D context of a stack of these latent representations. To accurately interpolate intermediateslices and construct a full 3D volume, we utilize body part regression to determine the spatial location and distance between the acquired slices. Experiments on both in-house and public 3D abdominal CT datasets demonstrate that the proposed method significantly enhances body composition analysis compared to traditional 2D-based analysis, with a reduced error rate from 23.3% to 15.2%.
zh

[CV-7] Beyond the Lungs: Extending the Field of View in Chest CT with Latent Diffusion Models

【速读】：该论文试图解决胸部CT成像中由于成本和辐射剂量考虑而导致的视野受限（FOV）问题，这种限制使得无法全面分析肺部疾病对其他器官（如肝脏和肾脏）的影响。为了解决这一问题，论文提出了一种名为SCOPE（Spatial Coverage Optimization with Prior Encoding）的新方法，旨在通过生成新的轴向切片来扩展胸部CT图像的视野。该方法的关键在于首先训练一个变分自编码器（VAE）来单独编码2D轴向CT切片，然后将VAE的潜在表示堆叠起来形成3D上下文，用于训练潜在扩散模型。一旦模型训练完成，SCOPE能够在零样本情况下生成新的轴向切片，从而在z方向上扩展CT图像的视野。实验结果表明，该方法能够有效扩展视野，涵盖原始NLST数据采集未完全覆盖的肝脏和肾脏区域，并且在生成切片的高保真度方面表现出色，SSIM达到0.81。

链接: https://arxiv.org/abs/2501.13068
作者: Lianrui Zuo,Kaiwen Xu,Dingjie Su,Xin Yu,Aravind R. Krishnan,Yihao Liu,Shunxing Bao,Thomas Li,Kim L. Sandler,Fabien Maldonado,Bennett A. Landman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The interconnection between the human lungs and other organs, such as the liver and kidneys, is crucial for understanding the underlying risks and effects of lung diseases and improving patient care. However, most research chest CT imaging is focused solely on the lungs due to considerations of cost and radiation dose. This restricted field of view (FOV) in the acquired images poses challenges to comprehensive analysis and hinders the ability to gain insights into the impact of lung diseases on other organs. To address this, we propose SCOPE (Spatial Coverage Optimization with Prior Encoding), a novel approach to capture the inter-organ relationships from CT images and extend the FOV of chest CT images. Our approach first trains a variational autoencoder (VAE) to encode 2D axial CT slices individually, then stacks the latent representations of the VAE to form a 3D context for training a latent diffusion model. Once trained, our approach extends the FOV of CT images in the z-direction by generating new axial slices in a zero-shot manner. We evaluated our approach on the National Lung Screening Trial (NLST) dataset, and results suggest that it effectively extends the FOV to include the liver and kidneys, which are not completely covered in the original NLST data acquisition. Quantitative results on a held-out whole-body dataset demonstrate that the generated slices exhibit high fidelity with acquired data, achieving an SSIM of 0.81.
zh

[CV-8] SMART-Vision: Survey of Modern Action Recognition Techniques in Vision

【速读】：该论文旨在解决人类动作识别（Human Action Recognition, HAR）领域中现有分类方法（taxonomies）的不足，特别是这些方法未能充分涵盖混合方法（hybrid methodologies）以及未能展示不同模型如何整合多种架构和模态的问题。为此，论文提出了一种新的分类方法——SMART-Vision taxonomy，该分类法系统地展示了深度学习在HAR领域的创新如何相互补充，从而推动超越传统分类的混合方法的发展。关键解决方案在于通过这一分类法，为从基础HAR研究到当前最先进系统的发展提供清晰的路线图，同时突出新兴研究方向并讨论HAR领域内架构的未解决挑战。此外，论文还探讨了开放HAR系统（Open-HAR systems）这一新兴领域，该系统通过在测试时引入未知类别样本，进一步挑战现有HAR系统的能力。

链接: https://arxiv.org/abs/2501.13066
作者: Ali K. AlShami,Ryan Rabinowitz,Khang Lam,Yousra Shleibik,Melkamu Mersha,Terrance Boult,Jugal Kalita
机构: Computer Science Department, University of Colorado, Colorado Springs (科罗拉多大学科罗拉多斯普林斯分校计算机科学系); Information Technology Department, Can Tho University (芹苴大学信息技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human Action Recognition (HAR) is a challenging domain in computer vision, involving recognizing complex patterns by analyzing the spatiotemporal dynamics of individuals’ movements in videos. These patterns arise in sequential data, such as video frames, which are often essential to accurately distinguish actions that would be ambiguous in a single image. HAR has garnered considerable interest due to its broad applicability, ranging from robotics and surveillance systems to sports motion analysis, healthcare, and the burgeoning field of autonomous vehicles. While several taxonomies have been proposed to categorize HAR approaches in surveys, they often overlook hybrid methodologies and fail to demonstrate how different models incorporate various architectures and modalities. In this comprehensive survey, we present the novel SMART-Vision taxonomy, which illustrates how innovations in deep learning for HAR complement one another, leading to hybrid approaches beyond traditional categories. Our survey provides a clear roadmap from foundational HAR works to current state-of-the-art systems, highlighting emerging research directions and addressing unresolved challenges in discussion sections for architectures within the HAR domain. We provide details of the research datasets that various approaches used to measure and compare goodness HAR approaches. We also explore the rapidly emerging field of Open-HAR systems, which challenges HAR systems by presenting samples from unknown, novel classes during test time.
zh

[CV-9] STMDNet: A Lightweight Directional Framework for Motion Pattern Recognition of Tiny Targets

【速读】：该论文试图解决在复杂背景中识别仅有几十个像素大小的微小目标运动的问题，尤其是在标准特征提取方法或深度学习方法在视觉线索稀缺的情况下失效的场景。解决方案的关键在于提出了STMDNet，一种基于模型的计算框架，通过设计一种新颖的双动力学和相关机制（dual-dynamics-and-correlation mechanism），利用同侧兴奋（ipsilateral excitation）整合目标线索，并通过泄漏增强型对侧抑制（leakage-enhancing-type contralateral inhibition）抑制大目标和背景运动干扰。此外，STMDNet开发了首个协作方向编码-解码策略（collaborative directional encoding-decoding strategy），仅通过每个空间位置的一个相关性确定运动方向，将计算成本降低至先前方法的八分之一。这些创新使得STMDNet在低采样频率场景下表现出色，显著提升了微小目标运动模式的识别性能。

链接: https://arxiv.org/abs/2501.13054
作者: Mingshuo Xu,Hao Luan,Zhou Daniel Hao,Jigen Peng,Shigang Yue
机构: School of Mathematics and Computing Science, University of Leicester, Leicester LE1 7RH, UK(莱斯特大学数学与计算科学学院); Tianjin Key Laboratory of Information Sensing and Intelligent Control, School of Automation and Electrical Engineering, Tianjin University of Technology and Education, Tuanjin 300222, China(天津职业技术师范大学自动化与电气工程学院信息感知与智能控制天津市重点实验室); Machine Life and Intelligence Research Center, Guangzhou University, Guangzhou 510006, China(广州大学生命与智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Recognizing motions of tiny targets - only few dozen pixels - in cluttered backgrounds remains a fundamental challenge when standard feature-based or deep learning methods fail under scarce visual cues. We propose STMDNet, a model-based computational framework to Recognize motions of tiny targets at variable velocities under low-sampling frequency scenarios. STMDNet designs a novel dual-dynamics-and-correlation mechanism, harnessing ipsilateral excitation to integrate target cues and leakage-enhancing-type contralateral inhibition to suppress large-object and background motion interference. Moreover, we develop the first collaborative directional encoding-decoding strategy that determines the motion direction from only one correlation per spatial location, cutting computational costs to one-eighth of prior methods. Further, simply substituting the backbone of a strong STMD model with STMDNet raises AUC by 24%, yielding an enhanced STMDNet-F. Evaluations on real-world low sampling frequency datasets show state-of-the-art results, surpassing the deep learning baseline. Across diverse speeds, STMDNet-F improves mF1 by 19%, 16%, and 8% at 240Hz, 120Hz, and 60Hz, respectively, while STMDNet achieves 87 FPS on a single CPU thread. These advances highlight STMDNet as a next-generation backbone for tiny target motion pattern recognition and underscore its broader potential to revitalize model-based visual approaches in motion detection.
zh

[CV-10] Sketch and Patch: Efficient 3D Gaussian Representation for Man-Made Scenes

【速读】：该论文试图解决3D高斯泼溅（3D Gaussian Splatting, 3DGS）在真实感渲染中存储需求过高的问题。3DGS虽然能够实现高质量的3D场景渲染，但其高存储需求限制了实际应用。论文的关键解决方案是基于高斯在场景中扮演的不同角色和特性，提出了一种新颖的混合表示方法。具体而言，高斯被分为两类：(i) 草图高斯（Sketch Gaussians），用于定义场景边界，捕捉高频特征如边缘和轮廓；(ii) 块高斯（Patch Gaussians），用于表示平滑区域，类似于绘画中的大面积笔触。草图高斯通过参数化模型高效编码，利用其几何一致性；而块高斯则通过优化的剪枝、重新训练和矢量量化来保持体积一致性和存储效率。实验结果表明，该方法在相同模型大小下，PSNR提高了32.62%，SSIM提高了19.12%，LPIPS提高了45.41%，并且在室内场景中，模型大小仅为原始模型的2.3%时仍能保持视觉质量。

链接: https://arxiv.org/abs/2501.13045
作者: Yuang Shi,Simone Gasparini,Géraldine Morin,Chenggang Yang,Wei Tsang Ooi
机构: National University of Singapore(新加坡国立大学); IRIT - University of Toulouse(图卢兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a promising representation for photorealistic rendering of 3D scenes. However, its high storage requirements pose significant challenges for practical applications. We observe that Gaussians exhibit distinct roles and characteristics that are analogous to traditional artistic techniques – Like how artists first sketch outlines before filling in broader areas with color, some Gaussians capture high-frequency features like edges and contours; While other Gaussians represent broader, smoother regions, that are analogous to broader brush strokes that add volume and depth to a painting. Based on this observation, we propose a novel hybrid representation that categorizes Gaussians into (i) Sketch Gaussians, which define scene boundaries, and (ii) Patch Gaussians, which cover smooth regions. Sketch Gaussians are efficiently encoded using parametric models, leveraging their geometric coherence, while Patch Gaussians undergo optimized pruning, retraining, and vector quantization to maintain volumetric consistency and storage efficiency. Our comprehensive evaluation across diverse indoor and outdoor scenes demonstrates that this structure-aware approach achieves up to 32.62% improvement in PSNR, 19.12% in SSIM, and 45.41% in LPIPS at equivalent model sizes, and correspondingly, for an indoor scene, our model maintains the visual quality with 2.3% of the original model size.
zh

[CV-11] Deep Learning-Based Image Recovery and Pose Estimation for Resident Space Objects

【速读】：该论文旨在解决地球轨道上航天器（Resident Space Object, RSO）识别、姿态和轨迹识别中的挑战，特别是由于缺乏可用于模型训练的真实图像数据而导致的训练困难。为了解决这一问题，论文提出了一种创新的框架，用于生成逼真的合成数据集，并以国际空间站（ISS）为测试案例，结合图像回归（image regression）和图像恢复（image restoration）方法来从模糊图像中估计姿态。解决方案的关键在于首先使用有效的点扩散函数（point spread function）进行图像去卷积，然后通过U-Net进行细节对象提取。研究表明，仅使用U-Net进行图像重建时，姿态估计性能最佳，图像恢复的平均均方误差（Mean Squared Error）减少了97.28%，平均角度误差减少了71.9%。通过结合U-Net图像恢复和Resnet50回归网络，成功实现了对国际空间站的姿态估计，展示了多样化的评估工具在解决实际问题（如地球轨道上远距离物体分析）中的价值。

链接: https://arxiv.org/abs/2501.13009
作者: Louis Aberdeen,Mark Hansen,Melvyn L. Smith,Lyndon Smith
机构: University of the West of England, Centre for Machine Vision (西英格兰大学, 机器视觉中心); Metrea Mission Data (Metrea Mission Data); Innovate UK (Innovate UK)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 10 pages, 13 figures

点击查看摘要

Abstract:As the density of spacecraft in Earth’s orbit increases, their recognition, pose and trajectory identification becomes crucial for averting potential collisions and executing debris removal operations. However, training models able to identify a spacecraft and its pose presents a significant challenge due to a lack of available image data for model training. This paper puts forth an innovative framework for generating realistic synthetic datasets of Resident Space Object (RSO) imagery. Using the International Space Station (ISS) as a test case, it goes on to combine image regression with image restoration methodologies to estimate pose from blurred images. An analysis of the proposed image recovery and regression techniques was undertaken, providing insights into the performance, potential enhancements and limitations when applied to real imagery of RSOs. The image recovery approach investigated involves first applying image deconvolution using an effective point spread function, followed by detail object extraction with a U-Net. Interestingly, using only U-Net for image reconstruction the best pose performance was attained, reducing the average Mean Squared Error in image recovery by 97.28% and the average angular error by 71.9%. The successful application of U-Net image restoration combined with the Resnet50 regression network for pose estimation of the International Space Station demonstrates the value of a diverse set of evaluation tools for effective solutions to real-world problems such as the analysis of distant objects in Earth’s orbit.
zh

[CV-12] UniUIR: Considering Underwater Image Restoration as An All-in-One Learner

【速读】：该论文旨在解决现有水下图像恢复（Underwater Image Restoration, UIR）方法在处理复杂水下场景退化时的局限性。现有方法通常仅处理颜色失真或同时处理颜色和雾霾问题，但往往忽略了水下场景中可能出现的更复杂的退化现象。为此，论文提出了一种称为UniUIR的通用水下图像恢复方法，以一体化方式处理真实水下场景中的混合退化问题。解决方案的关键在于设计了Mamba Mixture-of-Experts模块，该模块通过多个专家分别识别不同类型的退化，并协作提取任务特定的先验信息，同时基于线性复杂度保持全局特征表示。此外，论文还引入了空间-频率先验生成器（spatial-frequency prior generator），该模块在空间和频率域中提取退化先验信息，并根据图像内容自适应选择最合适的任务特定提示，从而提高图像恢复的准确性。最后，为了更有效地处理UIR任务中复杂且区域依赖的退化问题，论文结合了从大规模预训练深度预测模型中获取的深度信息，使网络能够感知并利用不同图像区域的深度变化来处理局部退化。实验结果表明，UniUIR在定性和定量比较中均能产生更具吸引力的结果，并展现出比现有最先进方法更强的泛化能力。

链接: https://arxiv.org/abs/2501.12981
作者: Xu Zhang,Huan Zhang,Guoli Wang,Qian Zhang,Lefei Zhang,Bo Du
机构: Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan 430072, China(武汉大学计算机科学学院人工智能研究所); Hubei Luojia Laboratory, Wuhan, China(湖北珞珈实验室); National Engineering Research Center for Multimedia Software, Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China(武汉大学多媒体软件国家工程研究中心, 湖北省多媒体与网络通信工程重点实验室); School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China(广东工业大学信息工程学院); Horizon Robotics, Beijing 100083, China(地平线机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:Existing underwater image restoration (UIR) methods generally only handle color distortion or jointly address color and haze issues, but they often overlook the more complex degradations that can occur in underwater scenes. To address this limitation, we propose a Universal Underwater Image Restoration method, termed as UniUIR, considering the complex scenario of real-world underwater mixed distortions as an all-in-one manner. To decouple degradation-specific issues and explore the inter-correlations among various degradations in UIR task, we designed the Mamba Mixture-of-Experts module. This module enables each expert to identify distinct types of degradation and collaboratively extract task-specific priors while maintaining global feature representation based on linear complexity. Building upon this foundation, to enhance degradation representation and address the task conflicts that arise when handling multiple types of degradation, we introduce the spatial-frequency prior generator. This module extracts degradation prior information in both spatial and frequency domains, and adaptively selects the most appropriate task-specific prompts based on image content, thereby improving the accuracy of image restoration. Finally, to more effectively address complex, region-dependent distortions in UIR task, we incorporate depth information derived from a large-scale pre-trained depth prediction model, thereby enabling the network to perceive and leverage depth variations across different image regions to handle localized degradation. Extensive experiments demonstrate that UniUIR can produce more attractive results across qualitative and quantitative comparisons, and shows strong generalization than state-of-the-art methods.
zh

[CV-13] LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

【速读】：该论文旨在解决线性注意力（Linear Attention）在图像合成任务中的架构设计和学习策略尚未充分探索的问题。论文提出了一套高效的线性扩散Transformer（Linear Diffusion Transformer, LiT）解决方案，其核心贡献包括：1）简化线性注意力机制，使用少量注意力头（few heads），在不增加延迟的情况下实现性能提升；2）通过权重继承（Weight Inheritance）从预训练的扩散Transformer初始化线性Transformer，并加载除线性注意力相关参数外的所有参数；3）采用混合知识蒸馏目标（Hybrid Knowledge Distillation Objective），利用预训练的扩散Transformer辅助学生线性Transformer的训练，不仅监督预测噪声，还监督反向扩散过程的方差。这些方法使得LiT在类条件ImageNet基准测试中，相比DiT减少了80%和77%的训练步骤，同时保持了竞争力的FID分数，并在文本到图像生成任务中能够快速合成高达1K分辨率的逼真图像。

链接: https://arxiv.org/abs/2501.12976
作者: Jiahao Wang,Ning Kang,Lewei Yao,Mengzhao Chen,Chengyue Wu,Songyang Zhang,Shuchen Xue,Yong Liu,Taiqiang Wu,Xihui Liu,Kaipeng Zhang,Shifeng Zhang,Wenqi Shao,Zhenguo Li,Ping Luo
机构: HKU(香港大学); Shanghai AI Lab(上海人工智能实验室); Huawei Noah’s Ark Lab(华为诺亚方舟实验室); UCAS(中国科学院大学); THUsz(清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 12 figures

点击查看摘要

Abstract:In commonly used sub-quadratic complexity modules, linear attention benefits from simplicity and high parallelism, making it promising for image synthesis tasks. However, the architectural design and learning strategy for linear attention remain underexplored in this field. In this paper, we offer a suite of ready-to-use solutions for efficient linear diffusion Transformers. Our core contributions include: (1) Simplified Linear Attention using few heads, observing the free-lunch effect of performance without latency increase. (2) Weight inheritance from a fully pre-trained diffusion Transformer: initializing linear Transformer using pre-trained diffusion Transformer and loading all parameters except for those related to linear attention. (3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), an efficient text-to-image Transformer that can be deployed offline on a laptop. Experiments show that in class-conditional 256256 and 512512 ImageNet benchmark LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT. LiT also rivals methods based on Mamba or Gated Linear Attention. Besides, for text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images. Project page: this https URL.
zh

[CV-14] MorphoSkel3D: Morphological Skeletonization of 3D Point Clouds for Informed Sampling in Object Classification and Retrieval

【速读】：该论文旨在解决点云（point clouds）处理中的一个关键问题，即如何有效地从点云数据中识别并采样出能够准确表示物体三维几何形状的子集。传统采样方法往往忽略了几何信息的整合，导致采样结果在保留物体结构方面表现不佳。论文提出的解决方案是引入一种基于形态学的新技术——MorphoSkel3D，该技术通过结合几何先验（geometrical priors）来增强采样过程中对物体底层结构的学习和保留能力。MorphoSkel3D的核心在于其低计算成本的规则化算法，能够生成定性的骨架（qualitative skeleton），从而指导局部和全局几何形状的采样。通过在ModelNet和ShapeNet两个大型数据集上的实验，论文证明了MorphoSkel3D在物体分类和点云检索等实际应用中的高效性和准确性。

链接: https://arxiv.org/abs/2501.12974
作者: Pierre Onghena,Santiago Velasco-Forero,Beatriz Marcotegui
机构: Mines Paris, PSL University, Center for Mathematical Morphology (CMM) (巴黎高等矿业学院, PSL大学, 数学形态学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point clouds are a set of data points in space to represent the 3D geometry of objects. A fundamental step in the processing is to identify a subset of points to represent the shape. While traditional sampling methods often ignore to incorporate geometrical information, recent developments in learning-based sampling models have achieved significant levels of performance. With the integration of geometrical priors, the ability to learn and preserve the underlying structure can be enhanced when sampling. To shed light into the shape, a qualitative skeleton serves as an effective descriptor to guide sampling for both local and global geometries. In this paper, we introduce MorphoSkel3D as a new technique based on morphology to facilitate an efficient skeletonization of shapes. With its low computational cost, MorphoSkel3D is a unique, rule-based algorithm to benchmark its quality and performance on two large datasets, ModelNet and ShapeNet, under different sampling ratios. The results show that training with MorphoSkel3D leads to an informed and more accurate sampling in the practical application of object classification and point cloud retrieval.
zh

[CV-15] A Novel Tracking Framework for Devices in X-ray Leverag ing Supplementary Cue-Driven Self-Supervised Features

【速读】：该论文旨在解决在介入性X射线序列中准确检测和跟踪冠状动脉介入手术中使用的设备（如导管、球囊和支架）的挑战。这些挑战主要源于对比血管和其他设备的遮挡、周围结构的干扰，以及小物体跟踪的困难。现有的跟踪方法通常依赖于过去和当前外观的空间相关性，但缺乏对运动的深刻理解，难以在复杂条件下有效检测多个设备实例。

解决方案的关键在于提出了一种自监督学习（self-supervised learning）方法，通过结合辅助线索和在多个表示空间中进行学习，增强了时空理解能力。此外，论文引入了一个通用的实时跟踪框架，该框架有效利用了预训练的时空网络，并考虑了历史外观和轨迹数据，从而显著提高了设备标志物的定位精度。该方法在球囊标志物和导管尖端的检测中，分别实现了87%和61%的最大误差减少，显著优于现有技术。

链接: https://arxiv.org/abs/2501.12958
作者: Saahil Islam,Venkatesh N. Murthy,Dominik Neumann,Serkan Cimen,Puneet Sharma,Andreas Maier,Dorin Comaniciu,Florin C. Ghesu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To restore proper blood flow in blocked coronary arteries via angioplasty procedure, accurate placement of devices such as catheters, balloons, and stents under live fluoroscopy or diagnostic angiography is crucial. Identified balloon markers help in enhancing stent visibility in X-ray sequences, while the catheter tip aids in precise navigation and co-registering vessel structures, reducing the need for contrast in angiography. However, accurate detection of these devices in interventional X-ray sequences faces significant challenges, particularly due to occlusions from contrasted vessels and other devices and distractions from surrounding, resulting in the failure to track such small objects. While most tracking methods rely on spatial correlation of past and current appearance, they often lack strong motion comprehension essential for navigating through these challenging conditions, and fail to effectively detect multiple instances in the scene. To overcome these limitations, we propose a self-supervised learning approach that enhances its spatio-temporal understanding by incorporating supplementary cues and learning across multiple representation spaces on a large dataset. Followed by that, we introduce a generic real-time tracking framework that effectively leverages the pretrained spatio-temporal network and also takes the historical appearance and trajectory data into account. This results in enhanced localization of multiple instances of device landmarks. Our method outperforms state-of-the-art methods in interventional X-ray device tracking, especially stability and robustness, achieving an 87% reduction in max error for balloon marker detection and a 61% reduction in max error for catheter tip detection.
zh

[CV-16] 3D Object Manipulation in a Single Image using Generative Models

【速读】：该论文旨在解决图像中物体操作（Object manipulation）的两个主要挑战：静态编辑（static editing）和动态生成（dynamic generation）的同步处理，以及物体外观和场景光照的真实性（fidelity）问题。为了解决这些问题，论文提出了OMG3D框架，该框架结合了精确的几何控制（precise geometric control）和扩散模型（diffusion models）的生成能力，显著提升了视觉表现。关键解决方案包括：1) 将2D物体转换为3D模型，使用户能够在几何层面上进行修改并生成逼真的运动；2) 引入CustomRefiner模块，通过预训练定制化的扩散模型来细化纹理，使3D粗糙模型的渲染细节和风格与原始图像对齐；3) 提出IllumiCombiner模块，估计并校正背景光照，以匹配人类视觉感知，从而生成更真实的阴影效果。这些步骤均可在单个NVIDIA 3090 GPU上完成。

链接: https://arxiv.org/abs/2501.12935
作者: Ruisi Zhao,Zechuan Zhang,Zongxin Yang,Yi Yang
机构: ReLER, CCAI, Zhejiang University (浙江大学); DBMI, HMS, Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object manipulation in images aims to not only edit the object’s presentation but also gift objects with motion. Previous methods encountered challenges in concurrently handling static editing and dynamic generation, while also struggling to achieve fidelity in object appearance and scene lighting. In this work, we introduce \textbfOMG3D, a novel framework that integrates the precise geometric control with the generative power of diffusion models, thus achieving significant enhancements in visual performance. Our framework first converts 2D objects into 3D, enabling user-directed modifications and lifelike motions at the geometric level. To address texture realism, we propose CustomRefiner, a texture refinement module that pre-train a customized diffusion model, aligning the details and style of coarse renderings of 3D rough model with the original image, further refine the texture. Additionally, we introduce IllumiCombiner, a lighting processing module that estimates and corrects background lighting to match human visual perception, resulting in more realistic shadow effects. Extensive experiments demonstrate the outstanding visual performance of our approach in both static and dynamic scenarios. Remarkably, all these steps can be done using one NVIDIA 3090. Project page is at this https URL
zh

[CV-17] DynamicEarth: How Far are We from Open-Vocabulary Change Detection?

【速读】：该论文旨在解决地球地表覆盖变化监测中现有方法依赖于预定义类别的问题，限制了其在开放世界应用中的有效性。为此，作者提出了一种新的任务——开放词汇变化检测（Open-Vocabulary Change Detection, OVCD），通过结合视觉和语言技术来检测任意类别的地表变化。为了解决高质量数据和标注的缺乏问题，作者提出了两种无需训练的框架：M-C-I和I-M-C。M-C-I框架的核心思想是先发现所有潜在的变化，然后对这些变化进行分类；而I-M-C框架则先识别所有感兴趣的目标，再判断其状态是否发生变化。基于这两种框架，作者实例化了多种方法，如SAM-DINOv2-SegEarth-OV和Grounding-DINO-SAM2-DINO等。通过在5个基准数据集上的广泛评估，证明了这些OVCD方法在泛化性和鲁棒性上优于现有的监督和无监督方法。此外，作者还发布了DynamicEarth代码库，以支持OVCD的进一步研究和应用。

链接: https://arxiv.org/abs/2501.12931
作者: Kaiyu Li,Xiangyong Cao,Yupeng Deng,Chao Pang,Zepeng Xin,Deyu Meng,Zhi Wang
机构: Xi’an Jiaotong University (西安交通大学); Chinese Academy of Sciences (中国科学院); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monitoring Earth’s evolving land covers requires methods capable of detecting changes across a wide range of categories and contexts. Existing change detection methods are hindered by their dependency on predefined classes, reducing their effectiveness in open-world applications. To address this issue, we introduce open-vocabulary change detection (OVCD), a novel task that bridges vision and language to detect changes across any category. Considering the lack of high-quality data and annotation, we propose two training-free frameworks, M-C-I and I-M-C, which leverage and integrate off-the-shelf foundation models for the OVCD task. The insight behind the M-C-I framework is to discover all potential changes and then classify these changes, while the insight of I-M-C framework is to identify all targets of interest and then determine whether their states have changed. Based on these two frameworks, we instantiate to obtain several methods, e.g., SAM-DINOv2-SegEarth-OV, Grounding-DINO-SAM2-DINO, etc. Extensive evaluations on 5 benchmark datasets demonstrate the superior generalization and robustness of our OVCD methods over existing supervised and unsupervised methods. To support continued exploration, we release DynamicEarth, a dedicated codebase designed to advance research and application of OVCD. this https URL
zh

[CV-18] PreciseCam: Precise Camera Control for Text-to-Image Generation

【速读】：该论文试图解决当前文本到图像生成模型（text-to-image models）在生成图像时缺乏对相机角度和镜头畸变等精确控制的问题。现有的方法通常依赖于预定义的拍摄角度或复杂的几何信息，限制了生成图像的灵活性和艺术表现力。论文提出了一种高效且通用的解决方案，通过仅使用四个简单的外参（extrinsic）和内参（intrinsic）相机参数，实现了对生成图像中相机视角的精确控制。这种方法无需依赖预定义的几何结构、参考3D对象或多视角数据，显著简化了生成过程。此外，论文还引入了一个包含超过57,000张图像及其对应文本提示和真实相机参数的新数据集，用于模型训练和评估。实验结果表明，该方法在文本到图像生成中的相机控制精度优于传统的提示工程（prompt engineering）方法。

链接: https://arxiv.org/abs/2501.12910
作者: Edurne Bernal-Berdun,Ana Serrano,Belen Masia,Matheus Gadelha,Yannick Hold-Geoffroy,Xin Sun,Diego Gutierrez
机构: Universidad de Zaragoza, I3A(萨拉戈萨大学, I3A); Adobe Research(Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Images as an artistic medium often rely on specific camera angles and lens distortions to convey ideas or emotions; however, such precise control is missing in current text-to-image models. We propose an efficient and general solution that allows precise control over the camera when generating both photographic and artistic images. Unlike prior methods that rely on predefined shots, we rely solely on four simple extrinsic and intrinsic camera parameters, removing the need for pre-existing geometry, reference 3D objects, and multi-view data. We also present a novel dataset with more than 57,000 images, along with their text prompts and ground-truth camera parameters. Our evaluation shows precise camera control in text-to-image generation, surpassing traditional prompt engineering approaches. Our data, model, and code are publicly available at this https URL.
zh

[CV-19] DocTTT: Test-Time Training for Handwritten Document Recognition Using Meta-Auxiliary Learning DATE WACV2025

【速读】：该论文试图解决手写文档识别（Handwritten Document Recognition, HDR）中在复杂背景、多样手写风格和不同文档布局下高效且准确识别文本的挑战，尤其是在标注数据稀缺的情况下。解决方案的关键在于提出了DocTTT框架，该框架通过测试时训练（test-time training）在测试阶段自适应每个特定输入。具体而言，论文提出了一种结合元学习（Meta-learning）和自监督掩码自编码器（Masked Autoencoder, MAE）的元辅助学习方法。在测试阶段，使用自监督MAE损失来调整视觉表示参数；在训练阶段，通过元学习框架学习模型参数，使模型能够有效适应新输入。实验结果表明，该方法在基准数据集上显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.12898
作者: Wenhao Gu,Li Gu,Ziqiang Wang,Ching Yee Suen,Yang Wang
机构: Department of Computer Science and Software Engineering, Concordia University (康考迪亚大学计算机科学与软件工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV2025, camera ready with updated reference

点击查看摘要

Abstract:Despite recent significant advancements in Handwritten Document Recognition (HDR), the efficient and accurate recognition of text against complex backgrounds, diverse handwriting styles, and varying document layouts remains a practical challenge. Moreover, this issue is seldom addressed in academic research, particularly in scenarios with minimal annotated data available. In this paper, we introduce the DocTTT framework to address these challenges. The key innovation of our approach is that it uses test-time training to adapt the model to each specific input during testing. We propose a novel Meta-Auxiliary learning approach that combines Meta-learning and self-supervised Masked Autoencoder~(MAE). During testing, we adapt the visual representation parameters using a self-supervised MAE loss. During training, we learn the model parameters using a meta-learning framework, so that the model parameters are learned to adapt to a new input effectively. Experimental results show that our proposed method significantly outperforms existing state-of-the-art approaches on benchmark datasets.
zh

[CV-20] CrossDiff: Diffusion Probabilistic Model With Cross-conditional Encoder-Decoder for Crack Segmentation

【速读】：该论文试图解决工业混凝土表面裂缝分割（Crack Segmentation）中的挑战性问题，尤其是针对形态复杂且细长的裂缝。传统分割方法在处理此类裂缝时往往难以精确定位，导致维护和修复过程的效率低下。论文提出的解决方案是CrossDiff模型，这是一种基于扩散概率模型（diffusion probabilistic model）的新型方法，首次将扩散模型引入裂缝分割任务。CrossDiff的关键创新在于其交叉条件编码器-解码器（cross-conditional encoder-decoder）结构，通过交叉编码器（cross-encoder）增强对裂缝细节的保留能力，并通过交叉解码器（cross-decoder）提取裂缝的语义特征，从而更好地处理细长裂缝。实验结果表明，CrossDiff在多个具有挑战性的裂缝数据集上表现优异，Dice分数和IoU指标均优于现有最先进方法8.0%。

链接: https://arxiv.org/abs/2501.12860
作者: Xianglong Shi,Yunhan Jiang,Xiaoheng Jiang,Mingling Xu,Yang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crack Segmentation in industrial concrete surfaces is a challenging task because cracks usually exhibit intricate morphology with slender appearances. Traditional segmentation methods often struggle to accurately locate such cracks, leading to inefficiencies in maintenance and repair processes. In this paper, we propose a novel diffusion-based model with a cross-conditional encoder-decoder, named CrossDiff, which is the first to introduce the diffusion probabilistic model for the crack segmentation task. Specifically, CrossDiff integrates a cross-encoder and a cross-decoder into the diffusion model to constitute a cross-shaped diffusion model structure. The cross-encoder enhances the ability to retain crack details and the cross-decoder helps extract the semantic features of cracks. As a result, CrossDiff can better handle slender cracks. Extensive experiments were conducted on five challenging crack datasets including CFD, CrackTree200, DeepCrack, GAPs384, and Rissbilder. The results demonstrate that the proposed CrossDiff model achieves impressive performance, outperforming other state-of-the-art methods by 8.0% in terms of both Dice score and IoU. The code will be open-source soon.
zh

[CV-21] GAMED-Snake: Gradient-aware Adaptive Momentum Evolution Deep Snake Model for Multi-organ Segmentation

【速读】：该论文试图解决多器官分割（multi-organ segmentation）中的挑战，包括复杂的解剖背景、模糊的边界以及多样的形态学特征。为了解决这些问题，论文提出了Gradient-aware Adaptive Momentum Evolution Deep Snake (GAMED-Snake)模型，其关键创新点在于：首先，Distance Energy Map Prior (DEMP)通过生成像素级的力场，有效引导轮廓点向真实边界靠近，即使在复杂背景和模糊边缘的情况下也能实现精确分割。其次，Differential Convolution Inception Module (DCIM)能够精确提取全面的能量梯度，显著提升分割精度。最后，Adaptive Momentum Evolution Mechanism (AMEM)利用交叉注意力机制在不同迭代过程中建立动态特征，从而实现对多样化形态的精确边界对齐。实验结果表明，GAMED-Snake在四个具有挑战性的多器官分割数据集上，相较于现有最先进方法，mDice指标提升了约2%。

链接: https://arxiv.org/abs/2501.12844
作者: Ruicheng Zhang,Haowei Guo,Zeyu Zhang,Puxin Yan,Shen Zhao
机构: Sun Yat-sen University(中山大学); The Australian National University(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-organ segmentation is a critical yet challenging task due to complex anatomical backgrounds, blurred boundaries, and diverse morphologies. This study introduces the Gradient-aware Adaptive Momentum Evolution Deep Snake (GAMED-Snake) model, which establishes a novel paradigm for contour-based segmentation by integrating gradient-based learning with adaptive momentum evolution mechanisms. The GAMED-Snake model incorporates three major innovations: First, the Distance Energy Map Prior (DEMP) generates a pixel-level force field that effectively attracts contour points towards the true boundaries, even in scenarios with complex backgrounds and blurred edges. Second, the Differential Convolution Inception Module (DCIM) precisely extracts comprehensive energy gradients, significantly enhancing segmentation accuracy. Third, the Adaptive Momentum Evolution Mechanism (AMEM) employs cross-attention to establish dynamic features across different iterations of evolution, enabling precise boundary alignment for diverse morphologies. Experimental results on four challenging multi-organ segmentation datasets demonstrate that GAMED-Snake improves the mDice metric by approximately 2% compared to state-of-the-art methods. Code will be available at this https URL.
zh

[CV-22] AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation

【速读】：该论文试图解决在临床实践中由于复杂的采集协议、严格的隐私法规或特定的临床需求，导致无法获取完整的成像数据（如多模态磁共振成像，MR）的问题。这一问题在脑肿瘤分割等任务中尤为突出，因为每种模态都提供了互补的信息，对提高分割精度至关重要。论文提出的解决方案是缺失数据填补（missing data imputation），即从可用的模态中生成缺失的模态。其关键在于提出了一种基于扩散模型的自适应多模态生成网络（Adaptive Multi-Modality Diffusion Network, AMM-Diff），该网络能够处理任意数量的输入模态并生成缺失的模态。具体而言，论文设计了一个图像-频率融合网络（Image-Frequency Fusion Network, IFFN），通过自监督预训练任务学习全输入模态及其高频傅里叶分量的统一特征表示。扩散模型利用这一表示，结合自适应重建策略，实现了缺失模态的填补。实验结果表明，该方法在BraTS 2021数据集上表现优异。

链接: https://arxiv.org/abs/2501.12840
作者: Aghiles Kebaili,Jérôme Lapuyade-Lahorgue,Pierre Vera,Su Ruan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In clinical practice, full imaging is not always feasible, often due to complex acquisition protocols, stringent privacy regulations, or specific clinical needs. However, missing MR modalities pose significant challenges for tasks like brain tumor segmentation, especially in deep learning-based segmentation, as each modality provides complementary information crucial for improving accuracy. A promising solution is missing data imputation, where absent modalities are generated from available ones. While generative models have been widely used for this purpose, most state-of-the-art approaches are limited to single or dual target translations, lacking the adaptability to generate missing modalities based on varying input configurations. To address this, we propose an Adaptive Multi-Modality Diffusion Network (AMM-Diff), a novel diffusion-based generative model capable of handling any number of input modalities and generating the missing ones. We designed an Image-Frequency Fusion Network (IFFN) that learns a unified feature representation through a self-supervised pretext task across the full input modalities and their selected high-frequency Fourier components. The proposed diffusion model leverages this representation, encapsulating prior knowledge of the complete modalities, and combines it with an adaptive reconstruction strategy to achieve missing modality completion. Experimental results on the BraTS 2021 dataset demonstrate the effectiveness of our approach.
zh

[CV-23] Enhancing Monocular Depth Estimation with Multi-Source Auxiliary Tasks WACV2025

【速读】：该论文试图解决单目深度估计（Monocular Depth Estimation, MDE）任务中高质量标注数据稀缺且成本高昂的问题。解决方案的关键在于利用来自相关视觉任务的辅助数据集，通过交替训练方案（alternating training scheme）进行训练，并在预训练的视觉基础模型（vision foundation model）上构建共享解码器（shared decoder），同时给予MDE任务更高的权重。实验表明，通过引入多种领域内辅助数据集和任务，MDE质量平均提升了约11%。此外，研究还发现，使用语义分割数据集作为多标签密集分类（Multi-Label Dense Classification, MLDC）任务通常能带来额外的质量提升。该方法显著提高了MDE数据集的数据效率，在减少数据集规模至少80%的同时提升了其质量，为在高质量标注数据有限的情况下利用相关任务的辅助数据改进MDE质量提供了新的途径。

链接: https://arxiv.org/abs/2501.12824
作者: Alessio Quercia,Erenus Yildiz,Zhuo Cao,Kai Krajsek,Abigail Morrison,Ira Assent,Hanno Scharr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at WACV 2025

点击查看摘要

Abstract:Monocular depth estimation (MDE) is a challenging task in computer vision, often hindered by the cost and scarcity of high-quality labeled datasets. We tackle this challenge using auxiliary datasets from related vision tasks for an alternating training scheme with a shared decoder built on top of a pre-trained vision foundation model, while giving a higher weight to MDE. Through extensive experiments we demonstrate the benefits of incorporating various in-domain auxiliary datasets and tasks to improve MDE quality on average by ~11%. Our experimental analysis shows that auxiliary tasks have different impacts, confirming the importance of task selection, highlighting that quality gains are not achieved by merely adding data. Remarkably, our study reveals that using semantic segmentation datasets as Multi-Label Dense Classification (MLDC) often results in additional quality gains. Lastly, our method significantly improves the data efficiency for the considered MDE datasets, enhancing their quality while reducing their size by at least 80%. This paves the way for using auxiliary data from related tasks to improve MDE quality despite limited availability of high-quality labeled data. Code is available at this https URL.
zh

[CV-24] Machine Learning Modeling for Multi-order Human Visual Motion Processing

【速读】：该论文试图解决计算机视觉（CV）模型与生物视觉系统在感知视觉运动方面的差距问题。尽管基于深度神经网络（DNN）的模型在自然图像中的光流估计方面取得了显著进展，但CV模型与生物视觉系统在架构和行为上仍存在显著差异，特别是在感知高阶图像特征（如二阶运动）方面。许多CV模型由于依赖强度守恒定律而无法捕捉二阶运动。论文提出的解决方案关键是通过模仿大脑皮层V1-MT运动处理通路，设计了一个双通路模型架构。该架构包括一个可训练的运动能量传感器库和一个循环图网络，用于模拟一阶（基于亮度）运动感知。对于二阶运动，模型引入了一个额外的感知通路，采用非线性预处理和简单的多层3D卷积神经网络（CNN）块来实现。通过在具有不同材料属性的运动物体数据集上进行训练，模型能够自然地获得感知二阶运动的能力，从而与生物系统保持一致，并能够泛化到自然场景中的一阶和二阶运动现象。

链接: https://arxiv.org/abs/2501.12810
作者: Zitang Sun,Yen-Ju Chen,Yung-Hao Yang,Yuan Li,Shin’ya Nishida
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Our research aims to develop machines that learn to perceive visual motion as do humans. While recent advances in computer vision (CV) have enabled DNN-based models to accurately estimate optical flow in naturalistic images, a significant disparity remains between CV models and the biological visual system in both architecture and behavior. This disparity includes humans’ ability to perceive the motion of higher-order image features (second-order motion), which many CV models fail to capture because of their reliance on the intensity conservation law. Our model architecture mimics the cortical V1-MT motion processing pathway, utilizing a trainable motion energy sensor bank and a recurrent graph network. Supervised learning employing diverse naturalistic videos allows the model to replicate psychophysical and physiological findings about first-order (luminance-based) motion perception. For second-order motion, inspired by neuroscientific findings, the model includes an additional sensing pathway with nonlinear preprocessing before motion energy sensing, implemented using a simple multilayer 3D CNN block. When exploring how the brain acquired the ability to perceive second-order motion in natural environments, in which pure second-order signals are rare, we hypothesized that second-order mechanisms were critical when estimating robust object motion amidst optical fluctuations, such as highlights on glossy surfaces. We trained our dual-pathway model on novel motion datasets with varying material properties of moving objects. We found that training to estimate object motion from non-Lambertian materials naturally endowed the model with the capacity to perceive second-order motion, as can humans. The resulting model effectively aligns with biological systems while generalizing to both first- and second-order motion phenomena in natural scenes.
zh

[CV-25] Modality Unified Attack for Omni-Modality Person Re-Identification

【速读】：该论文旨在解决基于深度学习的行人重识别（re-id）模型在面对对抗样本（AEs）时的脆弱性问题，特别是针对单模态、跨模态和多模态模型的统一攻击问题。由于在实际的黑箱监控系统中，攻击者无法预知目标系统部署的具体模型类型，因此论文提出了一种新颖的模态统一攻击方法（Modality Unified Attack, MUA），通过训练模态特定的对抗生成器来生成能够有效攻击不同模态模型的对抗样本。解决方案的关键在于采用多模态模型作为替代模型，并在特征融合前通过度量破坏损失（metric disruption loss）对每个模态的特征进行扰动。此外，论文引入了跨模态模拟破坏（Cross Modality Simulated Disruption）方法，通过故意将图像输入到非对应的模态特定子网络中来模拟跨模态特征嵌入，以及多模态协作破坏（Multi Modality Collaborative Disruption）策略，利用多模态特征协作度量破坏损失来全面破坏行人图像的信息内容。实验结果表明，该方法能够有效攻击全模态行人重识别模型，显著降低了模型的性能。

链接: https://arxiv.org/abs/2501.12761
作者: Yuan Bian,Min Liu,Yunqi Yi,Xueping Wang,Yunfeng Ma,Yaonan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages,3 figures

点击查看摘要

Abstract:Deep learning based person re-identification (re-id) models have been widely employed in surveillance systems. Recent studies have demonstrated that black-box single-modality and cross-modality re-id models are vulnerable to adversarial examples (AEs), leaving the robustness of multi-modality re-id models unexplored. Due to the lack of knowledge about the specific type of model deployed in the target black-box surveillance system, we aim to generate modality unified AEs for omni-modality (single-, cross- and multi-modality) re-id models. Specifically, we propose a novel Modality Unified Attack method to train modality-specific adversarial generators to generate AEs that effectively attack different omni-modality models. A multi-modality model is adopted as the surrogate model, wherein the features of each modality are perturbed by metric disruption loss before fusion. To collapse the common features of omni-modality models, Cross Modality Simulated Disruption approach is introduced to mimic the cross-modality feature embeddings by intentionally feeding images to non-corresponding modality-specific subnetworks of the surrogate model. Moreover, Multi Modality Collaborative Disruption strategy is devised to facilitate the attacker to comprehensively corrupt the informative content of person images by leveraging a multi modality feature collaborative metric disruption loss. Extensive experiments show that our MUA method can effectively attack the omni-modality re-id models, achieving 55.9%, 24.4%, 49.0% and 62.7% mean mAP Drop Rate, respectively.
zh

[CV-26] Patent Figure Classification using Large Vision-language Models

【速读】：该论文旨在解决专利图像分类（patent figure classification）中的多维度分类问题，特别是在零样本（zero-shot）和少样本（few-shot）学习场景下的挑战。现有的方法通常只针对单一维度或有限数量的概念进行分类，而本文则探索了大规模视觉-语言模型（LVLMs）在专利图像视觉问答（VQA）和分类任务中的有效性。为了支持这一研究，作者引入了两个新的数据集：PatFigVQA和PatFigCLS，用于对专利图像的多个维度（如类型、投影、专利类别和对象）进行微调和评估。关键解决方案包括提出了一种新颖的锦标赛式分类策略（tournament-style classification strategy），通过一系列多选题来高效处理大量类别。实验结果表明，基于LVLMs和卷积神经网络（CNNs）的少样本分类方法具有可行性。

链接: https://arxiv.org/abs/2501.12751
作者: Sushil Awale,Eric Müller-Budack,Ralph Ewerth
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Patent figure classification facilitates faceted search in patent retrieval systems, enabling efficient prior art search. Existing approaches have explored patent figure classification for only a single aspect and for aspects with a limited number of concepts. In recent years, large vision-language models (LVLMs) have shown tremendous performance across numerous computer vision downstream tasks, however, they remain unexplored for patent figure classification. Our work explores the efficacy of LVLMs in patent figure visual question answering (VQA) and classification, focusing on zero-shot and few-shot learning scenarios. For this purpose, we introduce new datasets, PatFigVQA and PatFigCLS, for fine-tuning and evaluation regarding multiple aspects of patent figures~(i.e., type, projection, patent class, and objects). For a computational-effective handling of a large number of classes using LVLM, we propose a novel tournament-style classification strategy that leverages a series of multiple-choice questions. Experimental results and comparisons of multiple classification approaches based on LVLMs and Convolutional Neural Networks (CNNs) in few-shot settings show the feasibility of the proposed approaches.
zh

[CV-27] Bad-PFL: Exploring Backdoor Attacks against Personalized Federated Learning ICLR2025

【速读】：该论文试图解决个性化联邦学习（Personalized Federated Learning, PFL）中的后门攻击（backdoor attacks）问题。尽管PFL在面对传统后门攻击时表现出一定的免疫性，但现有的后门攻击方法在PFL中失效，主要是因为手动设计的触发器（triggers）难以在个性化模型中持续存在。为了解决这一问题，论文提出了Bad-PFL，其关键创新在于利用自然数据的特征作为触发器。由于模型在训练过程中不可避免地会接触到自然数据，因此这种触发器能够长期嵌入到个性化模型中。此外，Bad-PFL通过触发器与模型的相互强化训练，进一步增强了后门的持久性和攻击效果。实验结果表明，该方法在多个基准数据集上对多种PFL方法均表现出优越的攻击性能，即使面对最先进的防御机制也能有效实施攻击。

链接: https://arxiv.org/abs/2501.12736
作者: Mingyuan Fan,Zhanyi Hu,Fuyi Wang,Cen Chen
机构: East China Normal University(华东师范大学); Deakin University(迪肯大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Data heterogeneity and backdoor attacks rank among the most significant challenges facing federated learning (FL). For data heterogeneity, personalized federated learning (PFL) enables each client to maintain a private personalized model to cater to client-specific knowledge. Meanwhile, vanilla FL has proven vulnerable to backdoor attacks. However, recent advancements in PFL community have demonstrated a potential immunity against such attacks. This paper explores this intersection further, revealing that existing federated backdoor attacks fail in PFL because backdoors about manually designed triggers struggle to survive in personalized models. To tackle this, we design Bad-PFL, which employs features from natural data as our trigger. As long as the model is trained on natural data, it inevitably embeds the backdoor associated with our trigger, ensuring its longevity in personalized models. Moreover, our trigger undergoes mutual reinforcement training with the model, further solidifying the backdoor’s durability and enhancing attack effectiveness. The large-scale experiments across three benchmark datasets demonstrate the superior performance of our attack against various PFL methods, even when equipped with state-of-the-art defense mechanisms.
zh

[CV-28] Combining Knowledge Graph and LLM s for Enhanced Zero-shot Visual Question Answering

【速读】：该论文试图解决零样本视觉问答（Zero-shot Visual Question Answering, ZS-VQA）中的关键问题，即在没有提供训练样本的情况下，如何准确回答视觉问题。现有研究分别利用知识图谱（knowledge graph）或大语言模型（Large Language Models, LLMs）作为外部信息源来帮助模型理解图像和问题，但LLMs在准确解释特定问题含义方面存在困难，而知识图谱虽然具有丰富的实体关系，却难以有效将实体与图像内容连接起来。论文提出了一种新颖的设计，结合知识图谱和LLMs的优势，通过LLMs的强大理解能力准确解释图像内容，并利用知识图谱扩展和连接用户查询与图像内容，从而实现更好的视觉问答。此外，论文还引入了一种优化算法，用于确定来自不同信息源的损失函数的最优权重，以获得全局最优的候选答案集。实验结果表明，该模型在两个基准数据集上达到了最先进的性能（state-of-the-art, SOTA）。

链接: https://arxiv.org/abs/2501.12697
作者: Qian Tao,Xiaoyang Fan,Yong Xu,Xingquan Zhu,Yufei Tang
机构: South China University of Technology (华南理工大学); Florida Atlantic University (佛罗里达大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot visual question answering (ZS-VQA), an emerged critical research area, intends to answer visual questions without providing training samples. Existing research in ZS-VQA has proposed to leverage knowledge graphs or large language models (LLMs), respectively, as external information sources to help VQA model comprehend images and questions. However, LLMs often struggle in accurately interpreting specific question meanings. Meanwhile, although knowledge graph has rich entity relationships, it is challenging to effectively connect entities to individual image content for visual question answers. In this paper, we propose a novel design to combine knowledge graph and LLMs for zero-shot visual question answer. Our approach uses LLMs’ powerful understanding capabilities to accurately interpret image content through a strategic question search mechanism. Meanwhile, the knowledge graph is used to expand and connect users’ queries to the image content for better visual question answering. An optimization algorithm is further used to determine the optimal weights for the loss functions derived from different information sources, towards a globally optimal set of candidate answers. Experimental results on two benchmark datasets demonstrate that our model achieves state-of-the-art (SOTA) performance. Both source code and benchmark data will be released for public access.
zh

[CV-29] Can masking background and object reduce static bias for zero-shot action recognition?

【速读】：该论文试图解决零样本动作识别（zero-shot action recognition）中的静态偏差（static bias）问题。静态偏差指的是模型在识别动作时过度依赖静态外观特征（如背景和物体），而不是人类动作本身。尽管基于CLIP（Contrastive Language–Image Pretraining）的零样本模型已经广泛应用，但它们是否足够关注人类动作仍不明确，因为CLIP主要捕捉与语言相关的外观特征。论文通过在不同训练和验证阶段对背景、物体和人物进行不同程度的掩码（masking）来研究静态偏差的影响。实验结果表明，掩码背景或物体可以有效减少模型对静态偏差的依赖，使其更专注于人类动作。具体而言，掩码背景在Kinetics400数据集上降低了模型性能，而在Mimetics数据集上则提升了性能；同时，对SSv2数据集中的背景和物体进行不同颜色的掩码也显著提高了模型性能。这些发现表明，掩码策略是解决静态偏差问题的关键。

链接: https://arxiv.org/abs/2501.12681
作者: Takumi Fukuzawa,Kensho Hara,Hirokatsu Kataoka,Toru Tamaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In proc. of MMM2025

点击查看摘要

Abstract:In this paper, we address the issue of static bias in zero-shot action recognition. Action recognition models need to represent the action itself, not the appearance. However, some fully-supervised works show that models often rely on static appearances, such as the background and objects, rather than human actions. This issue, known as static bias, has not been investigated for zero-shot. Although CLIP-based zero-shot models are now common, it remains unclear if they sufficiently focus on human actions, as CLIP primarily captures appearance features related to languages. In this paper, we investigate the influence of static bias in zero-shot action recognition with CLIP-based models. Our approach involves masking backgrounds, objects, and people differently during training and validation. Experiments with masking background show that models depend on background bias as their performance decreases for Kinetics400. However, for Mimetics, which has a weak background bias, masking the background leads to improved performance even if the background is masked during validation. Furthermore, masking both the background and objects in different colors improves performance for SSv2, which has a strong object bias. These results suggest that masking the background or objects during training prevents models from overly depending on static bias and makes them focus more on human action.
zh

[CV-30] Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization

【速读】：该论文试图解决Sharpness-Aware Minimization (SAM)在训练过程中对泛化能力提升的有效性机制不明确的问题。尽管SAM在各种任务中表现出显著的泛化能力提升，但其背后的原理尚未得到充分理解。论文通过分析SAM的训练动态，使用Hessian矩阵的最大特征值作为锐度（sharpness）的度量，提出了一个三阶随机微分方程（SDE），揭示了训练动态由二阶和三阶项的复杂混合驱动。研究发现，扰动向量与Hessian矩阵的顶部特征向量的对齐对SAM在锐度正则化中的有效性至关重要，但在实践中这种对齐往往不足，限制了SAM的效率。基于这些发现，论文提出了Eigen-SAM算法，该算法通过显式地将扰动向量与顶部特征向量对齐，旨在正则化Hessian矩阵的顶部特征值，从而提升SAM的效率。

链接: https://arxiv.org/abs/2501.12666
作者: Haocheng Luo,Tuan Truong,Tung Pham,Mehrtash Harandi,Dinh Phung,Trung Le
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has attracted significant attention for its effectiveness in improving generalization across various tasks. However, its underlying principles remain poorly understood. In this work, we analyze SAM’s training dynamics using the maximum eigenvalue of the Hessian as a measure of sharpness, and propose a third-order stochastic differential equation (SDE), which reveals that the dynamics are driven by a complex mixture of second- and third-order terms. We show that alignment between the perturbation vector and the top eigenvector is crucial for SAM’s effectiveness in regularizing sharpness, but find that this alignment is often inadequate in practice, limiting SAM’s efficiency. Building on these insights, we introduce Eigen-SAM, an algorithm that explicitly aims to regularize the top Hessian eigenvalue by aligning the perturbation vector with the leading eigenvector. We validate the effectiveness of our theory and the practical advantages of our proposed approach through comprehensive experiments. Code is available at this https URL.
zh

[CV-31] DWTNeRF: Boosting Few-shot Neural Radiance Fields via Discrete Wavelet Transform

【速读】：该论文旨在解决Neural Radiance Fields (NeRF) 在新视角合成和3D场景表示中的两个主要问题：收敛速度慢和对密集训练视图的依赖。为此，作者提出了DWTNeRF，这是一个基于Instant-NGP快速训练哈希编码的统一框架。该框架结合了专门为少样本NeRF设计的正则化项，能够在稀疏训练视图下有效工作。DWTNeRF的关键创新在于引入了一种新颖的离散小波损失（Discrete Wavelet loss），该损失允许在训练目标中直接优先处理低频信息，从而减少少样本NeRF在早期训练阶段对高频信息的过拟合。此外，作者还提出了一种基于多头注意力机制（multi-head attention）的模型方法，该方法与对架构变化敏感的INGP模型兼容。实验结果表明，在3-shot LLFF基准测试中，DWTNeRF在PSNR、SSIM和LPIPS指标上分别比Vanilla NeRF提升了15.07%、24.45%和36.30%。这一方法促使对当前基于INGP模型的少样本方法进行重新思考。

链接: https://arxiv.org/abs/2501.12637
作者: Hung Nguyen,Blark Runfa Li,Truong Nguyen
机构: Video Processing Lab, UC San Diego (视频处理实验室, 加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) has achieved superior performance in novel view synthesis and 3D scene representation, but its practical applications are hindered by slow convergence and reliance on dense training views. To this end, we present DWTNeRF, a unified framework based on Instant-NGP’s fast-training hash encoding. It is coupled with regularization terms designed for few-shot NeRF, which operates on sparse training views. Our DWTNeRF includes a novel Discrete Wavelet loss that allows explicit prioritization of low frequencies directly in the training objective, reducing few-shot NeRF’s overfitting on high frequencies in earlier training stages. We additionally introduce a model-based approach, based on multi-head attention, that is compatible with INGP-based models, which are sensitive to architectural changes. On the 3-shot LLFF benchmark, DWTNeRF outperforms Vanilla NeRF by 15.07% in PSNR, 24.45% in SSIM and 36.30% in LPIPS. Our approach encourages a re-thinking of current few-shot approaches for INGP-based models.
zh

[CV-32] Multiple Queries with Multiple Keys: A Precise Prompt Matching Paradigm for Prompt-based Continual Learning

【速读】：该论文旨在解决持续学习（Continual Learning）中提示选择（Prompt Selection）准确率低的问题，这一问题可能导致模型接收有偏知识并做出有偏预测。现有的基于提示的持续学习方法通过提示扩展和选择来有效缓解灾难性遗忘（Catastrophic Forgetting），但在提示选择过程中往往存在精度不足的缺陷。为此，论文提出了多查询多键（Multiple Queries with Multiple Keys, MQMK）提示匹配范式，其核心在于选择与测试样本数据分布最接近的训练数据提示。具体而言，多查询通过引入任务特定知识实现精确的广度搜索，而多键则通过细粒度表示训练样本的特征分布进行深度搜索。实验表明，MQMK在具有挑战性的场景中将提示匹配率提升了30%以上，并在三个广泛采用的持续学习基准上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.12635
作者: Dunwei Tu,Huiyu Yi,Yuchi Wang,Baile Xu,Jian Zhao,Furao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning requires machine learning models to continuously acquire new knowledge in dynamic environments while avoiding the forgetting of previous knowledge. Prompt-based continual learning methods effectively address the issue of catastrophic forgetting through prompt expansion and selection. However, existing approaches often suffer from low accuracy in prompt selection, which can result in the model receiving biased knowledge and making biased predictions. To address this issue, we propose the Multiple Queries with Multiple Keys (MQMK) prompt matching paradigm for precise prompt selection. The goal of MQMK is to select the prompts whose training data distribution most closely matches that of the test sample. Specifically, Multiple Queries enable precise breadth search by introducing task-specific knowledge, while Multiple Keys perform deep search by representing the feature distribution of training samples at a fine-grained level. Experiments show that MQMK enhances the prompt matching rate by over 30% in challenging scenarios and achieves state-of-the-art performance on three widely adopted continual learning benchmarks. Once this paper is accepted, we will release the code.
zh

[CV-33] D-Loc: Text Distillation for Weakly Supervised Object Localization

【速读】：该论文试图解决弱监督目标定位（Weakly Supervised Object Localization, WSOL）中的两个主要问题：一是传统WSOL方法（如类激活映射）依赖于分类目标，通常只关注最具区分性的物体部分，而忽略了物体的完整空间范围；二是基于视觉-语言模型（如CLIP）的WSOL方法需要真实类别标签或外部分类器来生成定位图，限制了其在下游任务中的部署。此外，现有方法（如GenPromp）虽然尝试解决这些问题，但由于其依赖于条件去噪过程和复杂的提示学习，引入了较高的复杂性。

论文提出的解决方案是Text Distillation for Localization (TeD-Loc)，其关键点在于直接从CLIP的文本嵌入（text embeddings）中蒸馏知识到模型骨干网络中，并生成图像块级别的定位。通过多实例学习（Multiple Instance Learning）这些图像块，TeD-Loc能够在不依赖外部分类器的情况下，使用单一模型实现准确的定位和分类。这种文本和视觉模态的集成解决了WSOL方法在文献中通常在不同训练周期收敛的问题，实现了定位和分类的同步优化。实验表明，TeD-Loc在CUB和ILSVRC数据集上的Top-1 LOC准确率比现有最先进模型提高了约5%，同时显著降低了计算复杂度。

链接: https://arxiv.org/abs/2501.12632
作者: Shakeeb Murtaza,Soufiane Belharbi,Marco Pedersoli,Eric Granger
机构: LIVIA, ILLS, Dept. of Systems Engineering, ETS Montreal, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.
zh

[CV-34] Adapting OpenAI s CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples

【速读】：该论文试图解决在制造业中基于图像的质检问题，特别是在小样本学习（few-shot learning）场景下的应用。解决方案的关键在于利用OpenAI的CLIP（Contrastive Language-Image Pretraining）模型，通过对比学习（contrastive learning）方法，将图像和文本的表示进行对齐，从而在少量标注数据的情况下实现高效的图像分类和质量检测。论文通过五个案例研究（包括金属表面检测、3D打印挤出轮廓分析、随机纹理表面评估、汽车装配检测和微观结构图像分类）评估了CLIP在制造业质检中的有效性。结果表明，CLIP在单一组件和基于纹理的应用中能够以较小的学习集（每类50-100个样本）实现高分类精度，但在复杂多组件场景中性能有所下降。论文还提供了一个实用的实现框架，帮助质量工程师快速评估CLIP是否适用于其特定应用场景，从而在追求更复杂解决方案之前做出决策。

链接: https://arxiv.org/abs/2501.12596
作者: Fadel M. Megahed,Ying-Ju Chen,Bianca Maria Colosimo,Marco Luigi Giuseppe Grasso,L. Allison Jones-Farmer,Sven Knoth,Hongyue Sun,Inez Zwetsloot
机构: Farmer School of Business, Miami University(迈阿密大学); College of Arts and Sciences, University of Dayton(代顿大学); Department of Mechanical Engineering, Politecnico di Milano(米兰理工大学); Mathematics & Statistics, Helmut Schmidt University(赫尔穆特·施密特大学); College of Engineering, University of Georgia(乔治亚大学); Amsterdam Business School, University of Amsterdam(阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Other Statistics (stat.OT)
备注: 31 pages, 13 figures

点击查看摘要

Abstract:This expository paper introduces a simplified approach to image-based quality inspection in manufacturing using OpenAI’s CLIP (Contrastive Language-Image Pretraining) model adapted for few-shot learning. While CLIP has demonstrated impressive capabilities in general computer vision tasks, its direct application to manufacturing inspection presents challenges due to the domain gap between its training data and industrial applications. We evaluate CLIP’s effectiveness through five case studies: metallic pan surface inspection, 3D printing extrusion profile analysis, stochastic textured surface evaluation, automotive assembly inspection, and microstructure image classification. Our results show that CLIP can achieve high classification accuracy with relatively small learning sets (50-100 examples per class) for single-component and texture-based applications. However, the performance degrades with complex multi-component scenes. We provide a practical implementation framework that enables quality engineers to quickly assess CLIP’s suitability for their specific applications before pursuing more complex solutions. This work establishes CLIP-based few-shot learning as an effective baseline approach that balances implementation simplicity with robust performance, demonstrated in several manufacturing quality control applications.
zh

[CV-35] ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality

【速读】：该论文旨在解决增强现实（AR）环境中虚拟内容可能对任务性能产生负面影响的问题，特别是两类任务有害的虚拟内容：遮挡攻击（obstruction attacks）和信息操纵攻击（information manipulation attacks）。遮挡攻击指虚拟内容遮挡了用户对真实世界物体的视线，而信息操纵攻击则指虚拟内容干扰了用户对真实世界信息的准确解读。为了解决这些问题，论文提出了ViDDAR（Vision language model-based Task-Detrimental content Detector for Augmented Reality），这是一个基于视觉语言模型（VLMs）和深度学习技术的全参考系统，能够在AR环境中监控和评估虚拟内容。ViDDAR采用用户-边缘-云架构，以在性能和低延迟之间取得平衡。该系统的关键创新在于首次将VLMs应用于AR环境中任务有害内容的检测，并通过实验验证了其有效性，遮挡攻击检测准确率达到92.15%，信息操纵攻击检测准确率为82.46%，且延迟较低。

链接: https://arxiv.org/abs/2501.12553
作者: Yanming Xiu,Tim Scargill,Maria Gorlatova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users’ ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users’ ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.
zh

[CV-36] How Does the Spatial Distribution of Pre-training Data Affect Geospatial Foundation Models? AAAI2025

【速读】：该论文试图解决地理空间基础模型（Geospatial Foundation Models, GFMs）在预训练数据选择上对模型性能的影响问题。具体而言，研究探讨了预训练数据的地理分布如何影响GFMs在下游任务中的表现。解决方案的关键在于通过从全球数据池中采样不同的数据组合，评估多种预训练数据分布对模型性能的影响。实验结果表明，平衡且具有全球代表性的数据组合通常优于特定区域的数据采样，强调了预训练数据的多样性和全球覆盖的重要性。此外，研究还指出，最合适的数据采样技术可能取决于具体的GFM架构。这些发现将为开发更稳健的GFMs提供支持，通过纳入高质量的预训练数据分布，最终提升地球观测领域的机器学习解决方案。

链接: https://arxiv.org/abs/2501.12535
作者: Mirali Purohit,Gedeon Muhawenayo,Esther Rolf,Hannah Kerner
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Good Data for Generative AI @ AAAI 2025

点击查看摘要

Abstract:Foundation models have made rapid advances in many domains including Earth observation, where Geospatial Foundation Models (GFMs) can help address global challenges such as climate change, agriculture, and disaster response. Previous work on GFMs focused on tailoring model architecture and pre-text tasks, and did not investigate the impact of pre-training data selection on model performance. However, recent works from other domains show that the pre-training data distribution is an important factor influencing the performance of the foundation models. With this motivation, our research explores how the geographic distribution of pre-training data affects the performance of GFMs. We evaluated several pre-training data distributions by sampling different compositions from a global data pool. Our experiments with two GFMs on downstream tasks indicate that balanced and globally representative data compositions often outperform region-specific sampling, highlighting the importance of diversity and global coverage in pre-training data. Our results suggest that the most appropriate data sampling technique may depend on the specific GFM architecture. These findings will support the development of robust GFMs by incorporating quality pre-training data distributions, ultimately improving machine learning solutions for Earth observation.
zh

[CV-37] Large-image Object Detection for Fine-grained Recognition of Punches Patterns in Medieval Panel Painting

【速读】：该论文旨在解决艺术品作者归属问题，传统上这一过程依赖于专家的主观评估，耗时且缺乏客观性。论文提出了一种基于机器学习（ML）技术的自动化解决方案，通过提取艺术品中的定量特征来支持作者归属的判定。具体而言，研究聚焦于13至14世纪托斯卡纳地区木板画中常见的重复机械压印图案（punches），这些图案的形状与特定艺术家或工作室之间存在强关联性。研究的关键在于使用YOLOv10这一先进的目标检测模型，结合滑动窗口方法和自定义的非极大值抑制（non-maximal suppression）算法，对大尺寸图像中的punches进行检测和提取。该方法为艺术史学家提供了一种可靠的工具，用于识别和提取punches，从而辅助作者归属的判定。

链接: https://arxiv.org/abs/2501.12489
作者: Josh Bruegger,Diana Ioana Catana,Vanja Macovaz,Matias Valdenegro-Toro,Matthia Sabatelli,Marco Zullich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The attribution of the author of an art piece is typically a laborious manual process, usually relying on subjective evaluations of expert figures. However, there are some situations in which quantitative features of the artwork can support these evaluations. The extraction of these features can sometimes be automated, for instance, with the use of Machine Learning (ML) techniques. An example of these features is represented by repeated, mechanically impressed patterns, called punches, present chiefly in 13th and 14th-century panel paintings from Tuscany. Previous research in art history showcased a strong connection between the shapes of punches and specific artists or workshops, suggesting the possibility of using these quantitative cues to support the attribution. In the present work, we first collect a dataset of large-scale images of these panel paintings. Then, using YOLOv10, a recent and popular object detection model, we train a ML pipeline to perform object detection on the punches contained in the images. Due to the large size of the images, the detection procedure is split across multiple frames by adopting a sliding-window approach with overlaps, after which the predictions are combined for the whole image using a custom non-maximal suppression routine. Our results indicate how art historians working in the field can reliably use our method for the identification and extraction of punches.
zh

[CV-38] fabSAM: A Farmland Boundary Delineation Method Based on the Segment Anything Model

【速读】：该论文旨在解决农田边界划分（farmland boundary delineation）问题，这对于农业管理中的作物监测和农业普查至关重要。传统的遥感影像方法虽然高效，但在泛化能力上存在局限。论文提出了一种基于Segment Anything Model (SAM)的农田边界划分框架“fabSAM”，该框架结合了基于Deeplabv3+的Prompter和SAM，并通过微调策略优化了SAM解码器对提示信息（prompt information）的利用。实验结果表明，fabSAM在AI4Boundaries和AI4SmallFarms数据集上的农田区域识别和边界划分性能显著提升，相较于零样本SAM和Deeplabv3+，fabSAM在mIOU（平均交并比）上分别提高了23.5%、15.1%和4.9%、12.5%。这一解决方案的关键在于通过提示学习和微调策略增强了SAM在遥感任务中的适应性和精度，从而能够更高效地从开源卫星影像数据（如Sentinel2）中获取全球农田区域和边界地图。

链接: https://arxiv.org/abs/2501.12487
作者: Yufeng Xie,Hanzhi Wu,Hongxiang Tong,Lei Xiao,Wenwen Zhou,Ling Li,Thomas Cherico Wanger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Delineating farmland boundaries is essential for agricultural management such as crop monitoring and agricultural census. Traditional methods using remote sensing imagery have been efficient but limited in generalisation. The Segment Anything Model (SAM), known for its impressive zero shot performance, has been adapted for remote sensing tasks through prompt learning and fine tuning. Here, we propose a SAM based farmland boundary delineation framework ‘fabSAM’ that combines a Deeplabv3+ based Prompter and SAM. Also, a fine tuning strategy was introduced to enable SAMs decoder to improve the use of prompt information. Experimental results on the AI4Boundaries and AI4SmallFarms datasets have shown that fabSAM has a significant improvement in farmland region identification and boundary delineation. Compared to zero shot SAM, fabSAM surpassed it by 23.5% and 15.1% in mIOU on the AI4Boundaries and AI4SmallFarms datasets, respectively. For Deeplabv3+, fabSAM outperformed it by 4.9% and 12.5% in mIOU, respectively. These results highlight the effectiveness of fabSAM, which also means that we can more easily obtain the global farmland region and boundary maps from open source satellite image datasets like Sentinel2.
zh

[CV-39] OFFE – Temporally-binned Object Flow from Events for High-speed and Energy-Efficient Object Detection and Tracking

【速读】：该论文旨在解决在边缘机器人系统（如小型无人机）中，高速运动场景下的物体检测与跟踪问题。传统基于帧的相机（frame-based cameras）虽然提供了丰富的空间信息，但其高能耗和低时间分辨率使其在高速运动场景中表现不佳。事件相机（event-based cameras）通过捕捉强度变化，提供了高时间分辨率和低功耗的优势，但其异步和稀疏的输出与传统深度学习方法不兼容。为此，论文提出了TOFFE，一种轻量级混合框架，结合了生物启发的脉冲神经网络（Spiking Neural Networks, SNNs）和传统的模拟神经网络（Analog Neural Networks, ANNs），以高效处理事件数据并实现高速运动场景下的物体运动估计（包括姿态、方向和速度估计）。TOFFE的关键在于其能够在高时间分辨率下处理事件数据，同时降低能耗和延迟，并通过新的事件合成数据集进行训练，显著优于现有的事件检测基线方法。

链接: https://arxiv.org/abs/2501.12482
作者: Adarsh Kumar Kosta,Amogh Joshi,Arjun Roy,Rohan Kumar Manna,Manish Nagaraj,Kaushik Roy
机构: Elmore Family School of Electrical and Computer Engineering, Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Object detection and tracking is an essential perception task for enabling fully autonomous navigation in robotic systems. Edge robot systems such as small drones need to execute complex maneuvers at high-speeds with limited resources, which places strict constraints on the underlying algorithms and hardware. Traditionally, frame-based cameras are used for vision-based perception due to their rich spatial information and simplified synchronous sensing capabilities. However, obtaining detailed information across frames incurs high energy consumption and may not even be required. In addition, their low temporal resolution renders them ineffective in high-speed motion scenarios. Event-based cameras offer a biologically-inspired solution to this by capturing only changes in intensity levels at exceptionally high temporal resolution and low power consumption, making them ideal for high-speed motion scenarios. However, their asynchronous and sparse outputs are not natively suitable with conventional deep learning methods. In this work, we propose TOFFE, a lightweight hybrid framework for performing event-based object motion estimation (including pose, direction, and speed estimation), referred to as Object Flow. TOFFE integrates bio-inspired Spiking Neural Networks (SNNs) and conventional Analog Neural Networks (ANNs), to efficiently process events at high temporal resolutions while being simple to train. Additionally, we present a novel event-based synthetic dataset involving high-speed object motion to train TOFFE. Our experimental results show that TOFFE achieves 5.7x/8.3x reduction in energy consumption and 4.6x/5.8x reduction in latency on edge GPU(Jetson TX2)/hybrid hardware(Loihi-2 and Jetson TX2), compared to previous event-based object detection baselines.
zh

[CV-40] CroMe: Multimodal Fake News Detection using Cross-Modal Tri-Transformer and Metric Learning

【速读】：该论文旨在解决多模态假新闻检测（Multimodal Fake News Detection）中的两个主要问题：现有方法通常独立编码单模态数据，未能充分利用模态内关系（intra-modality relationships）和模态间相似性（inter-modal similarities）的优势。为解决这些问题，论文提出了CroMe模型，其关键解决方案包括：1）使用BLIP2（Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models）作为编码器，以捕捉文本、图像及图文结合的详细表示；2）通过代理锚点方法（proxy anchor method）的度量学习模块捕获模态内关系；3）利用跨模态三重Transformer（Cross-Modal and Tri-Transformer）进行特征融合，有效整合多模态信息。最终，融合后的特征通过分类器预测内容的真实性。实验表明，CroMe在多模态假新闻检测任务中表现优异。

链接: https://arxiv.org/abs/2501.12422
作者: Eunjee Choi,Junhyun Ahn,XinYu Piao,Jong-Kook Kim
机构: Department of Electrical and Computer Engineering, Korea University, Republic of Korea (韩国高丽大学电气与计算机工程系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Fake News Detection has received increasing attention recently. Existing methods rely on independently encoded unimodal data and overlook the advantages of capturing intra-modality relationships and integrating inter-modal similarities using advanced techniques. To address these issues, Cross-Modal Tri-Transformer and Metric Learning for Multimodal Fake News Detection (CroMe) is proposed. CroMe utilizes Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (BLIP2) as encoders to capture detailed text, image and combined image-text representations. The metric learning module employs a proxy anchor method to capture intra-modality relationships while the feature fusion module uses a Cross-Modal and Tri-Transformer for effective integration. The final fake news detector processes the fused features through a classifier to predict the authenticity of the content. Experiments on datasets show that CroMe excels in multimodal fake news detection.
zh

[CV-41] ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models

【速读】：该论文旨在解决当前基于视觉-语言模型（Vision-Language Models, VLMs）的聊天机器人在对话过程中引用上下文相关图像的能力不足的问题。具体来说，现有的VLM驱动的聊天机器人虽然能够提供文本来源的引用，但在对话中引用与上下文相关的图像方面存在显著局限性。为此，论文提出了“上下文图像引用”（Contextual Image Reference）的概念，即根据对话上下文从检索文档中适当引用相关图像的能力，并系统地研究了VLMs在这一方面的能力。

解决方案的关键在于提出了ImageRef-VL方法，该方法通过对大规模手动整理的多模态对话数据集进行指令微调（instruction fine-tuning），显著提升了开源VLMs的图像引用能力。实验结果表明，ImageRef-VL不仅在上下文图像引用任务中优于专有模型，而且相较于最先进的开源VLMs，性能提升了88%。

链接: https://arxiv.org/abs/2501.12418
作者: Jingwei Yi,Junhao Yin,Ju Xu,Peng Bao,Yongliang Wang,Wei Fan,Hao Wang
机构: University of Science and Technology of China(中国科学技术大学); ByteDance(字节跳动); Peking University(北京大学); University of Oxford(牛津大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference – the ability to appropriately reference relevant images from retrieval documents based on conversation context – and systematically investigate VLMs’ capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs’ image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at this https URL.
zh

[CV-42] A polynomial formula for the perspective four points problem

【速读】：该论文旨在解决透视n点问题（Perspective n-Points Problem），特别是针对n=4的情况提出了一种快速且准确的解决方案。其核心创新在于通过变量分离的方法，将问题简化为绝对定向问题（Absolute Orientation Problem）。具体而言，给定四个3D点和相机画布上对应的四个2D点，首先找到另一组位于相机到2D点射线上的3D点，使得这些3D点之间的六对距离尽可能接近原始3D点之间的距离。这一步骤将透视问题转化为绝对定向问题，后者可以通过显式公式求解。为了进一步简化问题，作者采用了尽可能不受方向影响的坐标系：在3D点一侧使用点之间的平方距离作为坐标，而在2D画布点一侧则通过旋转使其中一个点位于光轴上，并使用点积作为坐标。最终，借助计算机代数系统推导出解决方案。

链接: https://arxiv.org/abs/2501.13058
作者: David Lehavi,Brian Osserman
机构: 未知
类目: Algebraic Geometry (math.AG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:We present a fast and accurate solution to the perspective n-points problem, by way of a new approach to the n=4 case. Our solution hinges on a novel separation of variables: given four 3D points and four corresponding 2D points on the camera canvas, we start by finding another set of 3D points, sitting on the rays connecting the camera to the 2D canvas points, so that the six pair-wise distances between these 3D points are as close as possible to the six distances between the original 3D points. This step reduces the perspective problem to an absolute orientation problem (which has a solution via explicit formula). To solve the first problem we set coordinates which are as orientation-free as possible: on the 3D points side our coordinates are the squared distances between the points. On the 2D canvas-points side our coordinates are the dot products of the points after rotating one of them to sit on the optical axis. We then derive the solution with the help of a computer algebra system.
zh

[CV-43] Learning accurate rigid registration for longitudinal brain MRI from synthetic data

【速读】：该论文试图解决在纵向（within-subject）图像配准中，现有机器学习方法在实现精确对齐方面的局限性问题。具体而言，现有的跨受试者（cross-subject）配准方法在处理纵向配准时表现不佳，而纵向配准在医学影像分析中尤为重要。论文提出了一种针对纵向刚性脑部配准优化的模型，该模型基于现有的解剖学感知、采集无关的仿射配准框架。解决方案的关键在于通过使用经过刚性和细微非线性变换增强的合成纵向图像对来训练模型，从而使其能够估计出比现有跨受试者网络更准确的刚性变换，并在不同磁共振成像（MRI）对比度下的纵向配准对中表现出鲁棒性。

链接: https://arxiv.org/abs/2501.13010
作者: Jingru Fu,Adrian V. Dalca,Bruce Fischl,Rodrigo Moreno,Malte Hoffmann
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures, 1 table, rigid image registration, deep learning, longitudinal analysis, neuroimaging, accepted by the IEEE International Symposium on Biomedical Imaging

点击查看摘要

Abstract:Rigid registration aims to determine the translations and rotations necessary to align features in a pair of images. While recent machine learning methods have become state-of-the-art for linear and deformable registration across subjects, they have demonstrated limitations when applied to longitudinal (within-subject) registration, where achieving precise alignment is critical. Building on an existing framework for anatomy-aware, acquisition-agnostic affine registration, we propose a model optimized for longitudinal, rigid brain registration. By training the model with synthetic within-subject pairs augmented with rigid and subtle nonlinear transforms, the model estimates more accurate rigid transforms than previous cross-subject networks and performs robustly on longitudinal registration pairs within and across magnetic resonance imaging (MRI) contrasts.
zh

[CV-44] FDG-Diff: Frequency-Domain-Guided Diffusion Framework for Compressed Hazy Image Restoration

【速读】：该论文旨在解决雾霾降解（haze degradation）与JPEG压缩（JPEG compression）共同作用时引入的复杂联合损失效应，这一问题显著增加了图像复原的难度。现有的去雾模型通常忽略压缩效应，导致其在实际应用中的效果受限。为解决这一问题，论文提出了三个关键贡献：首先，设计了FDG-Diff（Frequency-Domain-Guided Dehazing Framework），一种新颖的频域引导去雾框架，通过利用频域信息提升JPEG图像的复原质量；其次，引入了高频补偿模块（High-Frequency Compensation Module, HFCM），通过将频域增强技术融入基于扩散的复原框架，增强了空间域细节的复原；最后，提出了退化感知去噪时间步预测器（Degradation-Aware Denoising Timestep Predictor, DADTP），通过实现自适应的区域特异性复原，有效解决了压缩雾霾图像中区域退化不一致的问题。实验结果表明，该方法在多个压缩去雾数据集上均优于最新的先进方法。

链接: https://arxiv.org/abs/2501.12832
作者: Ruicheng Zhang,Kanghui Tian,Zeyu Zhang,Qixiang Liu,Zhi Jin
机构: Sun Yat-sen University (中山大学); The Australian National University (澳大利亚国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this study, we reveal that the interaction between haze degradation and JPEG compression introduces complex joint loss effects, which significantly complicate image restoration. Existing dehazing models often neglect compression effects, which limits their effectiveness in practical applications. To address these challenges, we introduce three key contributions. First, we design FDG-Diff, a novel frequency-domain-guided dehazing framework that improves JPEG image restoration by leveraging frequency-domain information. Second, we introduce the High-Frequency Compensation Module (HFCM), which enhances spatial-domain detail restoration by incorporating frequency-domain augmentation techniques into a diffusion-based restoration framework. Lastly, the introduction of the Degradation-Aware Denoising Timestep Predictor (DADTP) module further enhances restoration quality by enabling adaptive region-specific restoration, effectively addressing regional degradation inconsistencies in compressed hazy images. Experimental results across multiple compressed dehazing datasets demonstrate that our method consistently outperforms the latest state-of-the-art approaches. Code be available at this https URL.
zh

[CV-45] Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models

【速读】：该论文旨在解决单幅图像去模糊（single-image deblurring）问题，特别是由相机抖动和物体运动引起的复杂非线性模糊。传统方法通常依赖于空间域卷积模型，难以有效处理这些复杂的运动模糊。论文提出了一种新颖的去模糊方法，将运动模糊视为时间平均现象，并利用预训练的视频扩散变换器模型（video diffusion transformer model）在潜在空间中捕捉多样化的运动动态。该方法的关键创新在于避免了显式的核估计（kernel estimation），并通过扩散逆问题框架（diffusion-based inverse problem framework）有效处理了多种运动模式。实验结果表明，该方法在合成和真实数据集上均优于现有技术，为利用视频扩散模型解决单幅图像去模糊问题提供了新的思路。

链接: https://arxiv.org/abs/2501.12604
作者: Wang Pang,Zhihao Zhan,Xiang Zhu,Yechao Bai
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics within a latent space. It sidesteps explicit kernel estimation and effectively accommodates diverse motion patterns. We implement the algorithm within a diffusion-based inverse problem framework. Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios. This work paves the way for utilizing powerful video diffusion models to address single-image deblurring challenges.
zh

[CV-46] Efficient Lung Ultrasound Severity Scoring Using Dedicated Feature Extractor

【速读】：该论文试图解决在COVID-19检测中使用超声成像（ultrasound imaging）时面临的挑战，特别是由于公开可用的超声数据集规模有限且缺乏适当的标注，导致训练鲁棒的AI模型存在困难。为了解决这一问题，论文提出了MeDiVLAD，一种新颖的管道，用于多级肺部超声（LUS）严重程度评分。解决方案的关键在于利用自知识蒸馏（self-knowledge distillation）预训练视觉变换器（ViT），并通过双级VLAD聚合（dual-level VLAD aggregation）来聚合帧级特征。这种方法在最小微调的情况下，能够在帧级和视频级评分中优于传统的全监督方法，并提供高质量的分类推理，从而实现对关键肺部病理区域的自动识别，并为更广泛的医学视频分类任务提供鲁棒的解决方案。

链接: https://arxiv.org/abs/2501.12524
作者: Jiaqi Guo,Yunnan Wu,Evangelos Kaimakamis,Georgios Petmezas,Vasileios E. Papageorgiou,Nicos Maglaveras,Aggelos K. Katsaggelos
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE ISBI 2025

点击查看摘要

Abstract:With the advent of the COVID-19 pandemic, ultrasound imaging has emerged as a promising technique for COVID-19 detection, due to its non-invasive nature, affordability, and portability. In response, researchers have focused on developing AI-based scoring systems to provide real-time diagnostic support. However, the limited size and lack of proper annotation in publicly available ultrasound datasets pose significant challenges for training a robust AI model. This paper proposes MeDiVLAD, a novel pipeline to address the above issue for multi-level lung-ultrasound (LUS) severity scoring. In particular, we leverage self-knowledge distillation to pretrain a vision transformer (ViT) without label and aggregate frame-level features via dual-level VLAD aggregation. We show that with minimal finetuning, MeDiVLAD outperforms conventional fully-supervised methods in both frame- and video-level scoring, while offering classification reasoning with exceptional quality. This superior performance enables key applications such as the automatic identification of critical lung pathology areas and provides a robust solution for broader medical video classification tasks.
zh

[CV-47] Bidirectional Brain Image Translation using Transfer Learning from Generic Pre-trained Models

【速读】：该论文试图解决医学影像领域中的数据稀缺问题，特别是在脑部成像中，获取标记的医学影像既耗时又昂贵。为了解决这一问题，论文提出了一种基于迁移学习（transfer learning）的方法，利用预训练的CycleGAN模型进行MR-CT图像转换。关键解决方案在于将预训练的非医学影像模型（18个）进行微调，以生成高质量的医学影像。通过使用四种广泛应用的图像质量评估指标（峰值信噪比、结构相似性指数、通用质量指数和视觉信息保真度）进行定量评估，并结合放射科医生的定性感知分析，验证了迁移学习在医学影像生成中的潜力。结果表明，选择合适的代表性训练图像对优化脑部影像分析任务的性能至关重要。

链接: https://arxiv.org/abs/2501.12488
作者: Fatima Haimour,Rizik Al-Sayyed,Waleed Mahafza,Omar S. Al-Kadi
机构: Faculty of Information Technology, Zarqa University, Zarqa 13110, Jordan (扎尔卡大学信息技术学院); King Abdullah II School for Information Technology, University of Jordan, Amman 11942, Jordan (约旦大学阿卜杜拉二世信息技术学院); Department of Diagnostic Radiology, Jordan University Hospital, Amman 11942, Jordan (约旦大学医院放射诊断科)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注: 19 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Brain imaging plays a crucial role in the diagnosis and treatment of various neurological disorders, providing valuable insights into the structure and function of the brain. Techniques such as magnetic resonance imaging (MRI) and computed tomography (CT) enable non-invasive visualization of the brain, aiding in the understanding of brain anatomy, abnormalities, and functional connectivity. However, cost and radiation dose may limit the acquisition of specific image modalities, so medical image synthesis can be used to generate required medical images without actual addition. In the medical domain, where obtaining labeled medical images is labor-intensive and expensive, addressing data scarcity is a major challenge. Recent studies propose using transfer learning to overcome this issue. This involves adapting pre-trained CycleGAN models, initially trained on non-medical data, to generate realistic medical images. In this work, transfer learning was applied to the task of MR-CT image translation and vice versa using 18 pre-trained non-medical models, and the models were fine-tuned to have the best result. The models’ performance was evaluated using four widely used image quality metrics: Peak-signal-to-noise-ratio, Structural Similarity Index, Universal Quality Index, and Visual Information Fidelity. Quantitative evaluation and qualitative perceptual analysis by radiologists demonstrate the potential of transfer learning in medical imaging and the effectiveness of the generic pre-trained model. The results provide compelling evidence of the model’s exceptional performance, which can be attributed to the high quality and similarity of the training images to actual human brain images. These results underscore the significance of carefully selecting appropriate and representative training images to optimize performance in brain image analysis tasks.
zh

[CV-48] Slot-BERT: Self-supervised Object Discovery in Surgical Video

【速读】：该论文试图解决在手术视频中，传统基于对象（object-centric）的视频处理方法在保持长时间视频的远距离时间一致性（long-range temporal coherence）方面存在的挑战。传统方法通常依赖于循环处理（recurrent processing）以提高效率，但在处理长时间视频时难以维持所需的时间一致性。另一方面，完全并行处理（fully parallel processing）虽然增强了时间一致性，但带来了显著的计算开销，难以在医疗设施中的硬件上实现。

论文提出的解决方案是Slot-BERT，这是一种双向长距离模型（bidirectional long-range model），能够在潜在空间（latent space）中学习基于对象的表示，同时确保鲁棒的时间一致性。Slot-BERT通过无缝扩展对象发现（object discovery）到任意长度的长时间视频，解决了传统方法的局限性。此外，论文引入了一种新的槽对比损失（slot contrastive loss），通过增强槽的正交性（slot orthogonality）来减少冗余并改善表示的分离性（representation disentanglement）。该模型在多个真实世界的手术视频数据集上进行了评估，并在无监督训练下超越了现有的基于对象的方法，展示了其在跨领域的高效零样本域适应（zero-shot domain adaptation）能力。

链接: https://arxiv.org/abs/2501.12477
作者: Guiqiu Liao,Matjaz Jogan,Marcel Hussing,Kenta Nakahashi,Kazuhiro Yasufuku,Amin Madani,Eric Eaton,Daniel A. Hashimoto
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.
zh

[CV-49] Multi-stage intermediate fusion for multimodal learning to classify non-small cell lung cancer subtypes from CT and PET

【速读】：该论文旨在解决非小细胞肺癌（NSCLC）组织学亚型准确分类的问题，特别是在精准医学时代，现有的侵入性技术不仅不总是可行，还可能导致临床并发症。论文提出了一种多阶段中间融合（multi-stage intermediate fusion）方法，通过结合CT和PET图像来分类NSCLC亚型。该方法的关键在于在不同特征提取阶段整合两种模态，利用体素级融合（voxel-wise fusion）在不同抽象层次上挖掘互补信息，同时保留空间相关性。通过对比仅使用CT或PET图像的单模态方法，以及早期和晚期融合技术，论文展示了中间融合在特征提取阶段的优势。实验结果表明，该方法在关键指标上优于所有替代方案，准确率和AUC分别达到0.724和0.681，具有显著提高诊断准确性、促进更明智的治疗决策和推动肺癌个性化管理的潜力。

链接: https://arxiv.org/abs/2501.12425
作者: Fatih Aksu,Fabrizia Gelardi,Arturo Chiti,Paolo Soda
机构: Humanitas University (Humanitas大学); IRCCS Ospedale San Raffaele (圣拉斐尔医院); Umeå University (于默奥大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurate classification of histological subtypes of non-small cell lung cancer (NSCLC) is essential in the era of precision medicine, yet current invasive techniques are not always feasible and may lead to clinical complications. This study presents a multi-stage intermediate fusion approach to classify NSCLC subtypes from CT and PET images. Our method integrates the two modalities at different stages of feature extraction, using voxel-wise fusion to exploit complementary information across varying abstraction levels while preserving spatial correlations. We compare our method against unimodal approaches using only CT or PET images to demonstrate the benefits of modality fusion, and further benchmark it against early and late fusion techniques to highlight the advantages of intermediate fusion during feature extraction. Additionally, we compare our model with the only existing intermediate fusion method for histological subtype classification using PET/CT images. Our results demonstrate that the proposed method outperforms all alternatives across key metrics, with an accuracy and AUC equal to 0.724 and 0.681, respectively. This non-invasive approach has the potential to significantly improve diagnostic accuracy, facilitate more informed treatment decisions, and advance personalized care in lung cancer management.
zh

[CV-50] Comparative Analysis of Hand-Crafted and Machine-Driven Histopathological Features for Prostate Cancer Classification and Segmentation

【速读】：该论文旨在解决前列腺癌组织病理学图像中腺体结构的分割问题，以实现自动化Gleason分级。论文比较了两种方法：一种是基于手工特征提取的技术，结合灰度共生矩阵（Gray Level Co-Occurrence Matrix, GLCM）和局部二值模式（Local Binary Pattern, LBP）纹理描述符，以突出空间依赖性并最小化像素级信息丢失；另一种是基于U-Net卷积神经网络的机器驱动特征提取方法，用于前列腺腺体基质组织的语义分割。实验结果表明，基于手工特征的SVM分类器在GLCM和LBP上分别达到了99.0%和95.1%的分类准确率，而基于U-Net的机器驱动特征提取方法达到了94%的准确率。通过Jaccard和Dice指标的评估，U-Net方法在前列腺组织病理学分级1、2、3和4的分割质量上表现更优。该研究强调了机器驱动特征在前列腺组织图像自动化像素级分割的临床应用中的优势。

链接: https://arxiv.org/abs/2501.12415
作者: Feda Bolus Al Baqain,Omar Sultan Al-Kadi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 13 pages, 14 figures, 2 tables

点击查看摘要

Abstract:Histopathological image analysis is a reliable method for prostate cancer identification. In this paper, we present a comparative analysis of two approaches for segmenting glandular structures in prostate images to automate Gleason grading. The first approach utilizes a hand-crafted learning technique, combining Gray Level Co-Occurrence Matrix (GLCM) and Local Binary Pattern (LBP) texture descriptors to highlight spatial dependencies and minimize information loss at the pixel level. For machine driven feature extraction, we employ a U-Net convolutional neural network to perform semantic segmentation of prostate gland stroma tissue. Support vector machine-based learning of hand-crafted features achieves impressive classification accuracies of 99.0% and 95.1% for GLCM and LBP, respectively, while the U-Net-based machine-driven features attain 94% accuracy. Furthermore, a comparative analysis demonstrates superior segmentation quality for histopathological grades 1, 2, 3, and 4 using the U-Net approach, as assessed by Jaccard and Dice metrics. This work underscores the utility of machine-driven features in clinical applications that rely on automated pixel-level segmentation in prostate tissue images.
zh

人工智能

[AI-0] Guaranteed Recovery of Unambiguous Clusters

链接: https://arxiv.org/abs/2501.13093
作者: Kayvon Mazooji,Ilan Shomorony
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 11 pages

点击查看摘要

Abstract:Clustering is often a challenging problem because of the inherent ambiguity in what the “correct” clustering should be. Even when the number of clusters K is known, this ambiguity often still exists, particularly when there is variation in density among different clusters, and clusters have multiple relatively separated regions of high density. In this paper we propose an information-theoretic characterization of when a K -clustering is ambiguous, and design an algorithm that recovers the clustering whenever it is unambiguous. This characterization formalizes the situation when two high density regions within a cluster are separable enough that they look more like two distinct clusters than two truly distinct clusters in the clustering. The algorithm first identifies K partial clusters (or “seeds”) using a density-based approach, and then adds unclustered points to the initial K partial clusters in a greedy manner to form a complete clustering. We implement and test a version of the algorithm that is modified to effectively handle overlapping clusters, and observe that it requires little parameter selection and displays improved performance on many datasets compared to widely used algorithms for non-convex cluster recovery.

[AI-1] Attention-Driven Hierarchical Reinforcement Learning with Particle Filtering for Source Localization in Dynamic Fields

链接: https://arxiv.org/abs/2501.13084
作者: Yiwei Shi,Mengyue Yang,Qi Zhang,Weinan Zhang,Cunjia Liu,Weiru Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In many real-world scenarios, such as gas leak detection or environmental pollutant tracking, solving the Inverse Source Localization and Characterization problem involves navigating complex, dynamic fields with sparse and noisy observations. Traditional methods face significant challenges, including partial observability, temporal and spatial dynamics, out-of-distribution generalization, and reward sparsity. To address these issues, we propose a hierarchical framework that integrates Bayesian inference and reinforcement learning. The framework leverages an attention-enhanced particle filtering mechanism for efficient and accurate belief updates, and incorporates two complementary execution strategies: Attention Particle Filtering Planning and Attention Particle Filtering Reinforcement Learning. These approaches optimize exploration and adaptation under uncertainty. Theoretical analysis proves the convergence of the attention-enhanced particle filter, while extensive experiments across diverse scenarios validate the framework’s superior accuracy, adaptability, and computational efficiency. Our results highlight the framework’s potential for broad applications in dynamic field estimation tasks.

[AI-2] Boosting MCTS with Free Energy Minimization

链接: https://arxiv.org/abs/2501.13083
作者: Mawaba Pascal Dao,Adrian Peter
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Active Inference, grounded in the Free Energy Principle, provides a powerful lens for understanding how agents balance exploration and goal-directed behavior in uncertain environments. Here, we propose a new planning framework, that integrates Monte Carlo Tree Search (MCTS) with active inference objectives to systematically reduce epistemic uncertainty while pursuing extrinsic rewards. Our key insight is that MCTS already renowned for its search efficiency can be naturally extended to incorporate free energy minimization by blending expected rewards with information gain. Concretely, the Cross-Entropy Method (CEM) is used to optimize action proposals at the root node, while tree expansions leverage reward modeling alongside intrinsic exploration bonuses. This synergy allows our planner to maintain coherent estimates of value and uncertainty throughout planning, without sacrificing computational tractability. Empirically, we benchmark our planner on a diverse set of continuous control tasks, where it demonstrates performance gains over both standalone CEM and MCTS with random rollouts.

[AI-3] Evolution and The Knightian Blindspot of Machine Learning

链接: https://arxiv.org/abs/2501.13075
作者: Joel Lehman,Elliot Meyerson,Tarek El-Gaaly,Kenneth O. Stanley,Tarin Ziyaee
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper claims that machine learning (ML) largely overlooks an important facet of general intelligence: robustness to a qualitatively unknown future in an open world. Such robustness relates to Knightian uncertainty (KU) in economics, i.e. uncertainty that cannot be quantified, which is excluded from consideration in ML’s key formalisms. This paper aims to identify this blind spot, argue its importance, and catalyze research into addressing it, which we believe is necessary to create truly robust open-world AI. To help illuminate the blind spot, we contrast one area of ML, reinforcement learning (RL), with the process of biological evolution. Despite staggering ongoing progress, RL still struggles in open-world situations, often failing under unforeseen situations. For example, the idea of zero-shot transferring a self-driving car policy trained only in the US to the UK currently seems exceedingly ambitious. In dramatic contrast, biological evolution routinely produces agents that thrive within an open world, sometimes even to situations that are remarkably out-of-distribution (e.g. invasive species; or humans, who do undertake such zero-shot international driving). Interestingly, evolution achieves such robustness without explicit theory, formalisms, or mathematical gradients. We explore the assumptions underlying RL’s typical formalisms, showing how they limit RL’s engagement with the unknown unknowns characteristic of an ever-changing complex world. Further, we identify mechanisms through which evolutionary processes foster robustness to novel and unpredictable challenges, and discuss potential pathways to algorithmically embody them. The conclusion is that the intriguing remaining fragility of ML may result from blind spots in its formalisms, and that significant gains may result from direct confrontation with the challenge of KU.

[AI-4] AdaWM: Adaptive World Model based Planning for Autonomous Driving ICLR2025

链接: https://arxiv.org/abs/2501.13072
作者: Hang Wang,Xin Ye,Feng Tao,Abhirup Mallik,Burhaneddin Yaman,Liu Ren,Junshan Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: ICLR 2025

点击查看摘要

Abstract:World model based reinforcement learning (RL) has emerged as a promising approach for autonomous driving, which learns a latent dynamics model and uses it to train a planning policy. To speed up the learning process, the pretrain-finetune paradigm is often used, where online RL is initialized by a pretrained model and a policy learned offline. However, naively performing such initialization in RL may result in dramatic performance degradation during the online interactions in the new task. To tackle this challenge, we first analyze the performance degradation and identify two primary root causes therein: the mismatch of the planning policy and the mismatch of the dynamics model, due to distribution shift. We further analyze the effects of these factors on performance degradation during finetuning, and our findings reveal that the choice of finetuning strategies plays a pivotal role in mitigating these effects. We then introduce AdaWM, an Adaptive World Model based planning method, featuring two key steps: (a) mismatch identification, which quantifies the mismatches and informs the finetuning strategy, and (b) alignment-driven finetuning, which selectively updates either the policy or the model as needed using efficient low-rank updates. Extensive experiments on the challenging CARLA driving tasks demonstrate that AdaWM significantly improves the finetuning process, resulting in more robust and efficient performance in autonomous driving systems.

[AI-5] Optimizing Return Distributions with Distributional Dynamic Programming

链接: https://arxiv.org/abs/2501.13028
作者: Bernardo Ávila Pires,Mark Rowland,Diana Borsa,Zhaohan Daniel Guo,Khimya Khetarpal,André Barreto,David Abel,Rémi Munos,Will Dabney
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution, with standard reinforcement learning as a special case. Previous distributional DP methods could optimize the same class of expected utilities as classic DP. To go beyond expected utilities, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL, where the MDP state is augmented with a statistic of the rewards obtained so far (since the first time step). We find that a number of recently studied problems can be formulated as stock-augmented return distribution optimization, and we show that we can use distributional DP to solve them. We analyze distributional value and policy iteration, with bounds and a study of what objectives these distributional DP methods can or cannot optimize. We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems, for example maximizing conditional value-at-risk, and homeostatic regulation. To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we combine the core ideas of distributional value iteration with the deep RL agent DQN, and empirically evaluate it for solving instances of the applications discussed.

[AI-6] Provably-Safe Neural Network Training Using Hybrid Zonotope Reachability Analysis

链接: https://arxiv.org/abs/2501.13023
作者: Long Kiu Chung,Shreyas Kousik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Even though neural networks are being increasingly deployed in safety-critical applications, it remains difficult to enforce constraints on their output, meaning that it is hard to guarantee safety in such settings. Towards addressing this, many existing methods seek to verify a neural network’s satisfaction of safety constraints, but do not address how to correct an “unsafe” network. On the other hand, the few works that extract a training signal from verification cannot handle non-convex sets, and are either conservative or slow. To address these challenges, this work proposes a neural network training method that can encourage the exact reachable set of a non-convex input set through a neural network with rectified linear unit (ReLU) nonlinearities to avoid a non-convex unsafe region, using recent results in non-convex set representation with hybrid zonotopes and extracting gradient information from mixed-integer linear programs (MILPs). The proposed method is fast, with the computational complexity of each training iteration comparable to that of solving a linear program (LP) with number of dimensions and constraints linear to the number of neurons and complexity of input and unsafe sets. For a neural network with three hidden layers of width 30, the method was able to drive the reachable set of a non-convex input set with 55 generators and 26 constraints out of a non-convex unsafe region with 21 generators and 11 constraints in 490 seconds.

[AI-7] Paper Quality Assessment based on Individual Wisdom Metrics from Open Peer Review

链接: https://arxiv.org/abs/2501.13014
作者: Andrii Zahorodnii,Jasper J.F. van den Bosch,Ian Charest,Christopher Summerfield,Ila R. Fiete
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 15 pages, 5 main text figures, 3 supplementary figures

点击查看摘要

Abstract:This study proposes a data-driven framework for enhancing the accuracy and efficiency of scientific peer review through an open, bottom-up process that estimates reviewer quality. Traditional closed peer review systems, while essential for quality control, are often slow, costly, and subject to biases that can impede scientific progress. Here, we introduce a method that evaluates individual reviewer reliability by quantifying agreement with community consensus scores and applying Bayesian weighting to refine paper quality assessments. We analyze open peer review data from two major scientific conferences, and demonstrate that reviewer-specific quality scores significantly improve the reliability of paper quality estimation. Perhaps surprisingly, we find that reviewer quality scores are unrelated to authorship quality. Our model incorporates incentive structures to recognize high-quality reviewers and encourage broader coverage of submitted papers, thereby mitigating the common “rich-get-richer” pitfall of social media. These findings suggest that open peer review, with mechanisms for estimating and incentivizing reviewer quality, offers a scalable and equitable alternative for scientific publishing, with potential to enhance the speed, fairness, and transparency of the peer review process.

[AI-8] MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

链接: https://arxiv.org/abs/2501.13011
作者: Sebastian Farquhar,Vikrant Varma,David Lindner,David Elson,Caleb Biddulph,Ian Goodfellow,Rohin Shah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step “reward hacks”) even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

[AI-9] Ehrenfeucht-Haussler Rank and Chain of Thought

链接: https://arxiv.org/abs/2501.12997
作者: Pablo Barceló,Alexander Kozachinskiy,Tomasz Steifer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The notion of rank of a Boolean function has been a cornerstone in the theory of PAC learning, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function f corresponds to the minimum number of Chain of Thought (CoT) steps required by a single-layer transformer decoder with hard attention to compute f . Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that \ell -fold function composition necessitates exactly \ell CoT steps. Furthermore, we analyze the problem of identifying the position of the k -th occurrence of 1 in a Boolean sequence, proving that it requires k CoT steps.

[AI-10] Galois groups of polynomials and neurosymbolic networks

链接: https://arxiv.org/abs/2501.12978
作者: Elira Shaska,Tony Shaska
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); History and Overview (math.HO)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to understanding Galois theory, one of the foundational areas of algebra, through the lens of machine learning. By analyzing polynomial equations with machine learning techniques, we aim to streamline the process of determining solvability by radicals and explore broader applications within Galois theory. This summary encapsulates the background, methodology, potential applications, and challenges of using data science in Galois theory. More specifically, we design a neurosymbolic network to classify Galois groups and show how this is more efficient than usual neural networks. We discover some very interesting distribution of polynomials for groups not isomorphic to the symmetric groups and alternating groups. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); History and Overview (math.HO) ACMclasses: I.2.3 Cite as: arXiv:2501.12978 [cs.LG] (or arXiv:2501.12978v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.12978 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-11] Accessible Smart Contracts Verification: Synthesizing Formal Models with Tamed LLM s

链接: https://arxiv.org/abs/2501.12972
作者: Jan Corazza,Ivan Gavran,Gabriela Moreira,Daniel Neider
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:When blockchain systems are said to be trustless, what this really means is that all the trust is put into software. Thus, there are strong incentives to ensure blockchain software is correct – vulnerabilities here cost millions and break businesses. One of the most powerful ways of establishing software correctness is by using formal methods. Approaches based on formal methods, however, induce a significant overhead in terms of time and expertise required to successfully employ them. Our work addresses this critical disadvantage by automating the creation of a formal model – a mathematical abstraction of the software system – which is often a core task when employing formal methods. We perform model synthesis in three phases: we first transpile the code into model stubs; then we “fill in the blanks” using a large language model (LLM); finally, we iteratively repair the generated model, on both syntactical and semantical level. In this way, we significantly reduce the amount of time necessary to create formal models and increase accessibility of valuable software verification methods that rely on them. The practical context of our work was reducing the time-to-value of using formal models for correctness audits of smart contracts.

[AI-12] Its complicated. The relationship of algorithmic fairness and non-discrimination regulations in the EU AI Act

链接: https://arxiv.org/abs/2501.12962
作者: Kristof Meding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:What constitutes a fair decision? This question is not only difficult for humans but becomes more challenging when Artificial Intelligence (AI) models are used. In light of discriminatory algorithmic behaviors, the EU has recently passed the AI Act, which mandates specific rules for AI models, incorporating both traditional legal non-discrimination regulations and machine learning based algorithmic fairness concepts. This paper aims to bridge these two different concepts in the AI Act through: First a high-level introduction of both concepts targeting legal and computer science-oriented scholars, and second an in-depth analysis of the AI Act’s relationship between legal non-discrimination regulations and algorithmic fairness. Our analysis reveals three key findings: (1.), most non-discrimination regulations target only high-risk AI systems. (2.), the regulation of high-risk systems encompasses both data input requirements and output monitoring, though these regulations are often inconsistent and raise questions of computational feasibility. (3.) Regulations for General Purpose AI Models, such as Large Language Models that are not simultaneously classified as high-risk systems, currently lack specificity compared to other regulations. Based on these findings, we recommend developing more specific auditing and testing methodologies for AI systems. This paper aims to serve as a foundation for future interdisciplinary collaboration between legal scholars and computer science-oriented machine learning researchers studying discrimination in AI systems.

[AI-13] GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

链接: https://arxiv.org/abs/2501.12956
作者: Pengxiang Zhao,Xiaoming Yuan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ’s ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ’s quantized models achieve up to 2.57 \times speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

[AI-14] Offline Critic-Guided Diffusion Policy for Multi-User Delay-Constrained Scheduling

链接: https://arxiv.org/abs/2501.12942
作者: Zhuoran Li,Ruishuo Chen,Hai Zhong,Longbo Huang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective multi-user delay-constrained scheduling is crucial in various real-world applications, such as instant messaging, live streaming, and data center management. In these scenarios, schedulers must make real-time decisions to satisfy both delay and resource constraints without prior knowledge of system dynamics, which are often time-varying and challenging to estimate. Current learning-based methods typically require interactions with actual systems during the training stage, which can be difficult or impractical, as it is capable of significantly degrading system performance and incurring substantial service costs. To address these challenges, we propose a novel offline reinforcement learning-based algorithm, named \underlineScheduling By \underlineOffline Learning with \underlineCritic Guidance and \underlineDiffusion Generation (SOCD), to learn efficient scheduling policies purely from pre-collected \emphoffline data. SOCD innovatively employs a diffusion-based policy network, complemented by a sampling-free critic network for policy guidance. By integrating the Lagrangian multiplier optimization into the offline reinforcement learning, SOCD effectively trains high-quality constraint-aware policies exclusively from available datasets, eliminating the need for online interactions with the system. Experimental results demonstrate that SOCD is resilient to various system dynamics, including partially observable and large-scale environments, and delivers superior performance compared to existing methods.

[AI-15] Learning Graph Node Embeddings by Smooth Pair Sampling AISTATS2025

链接: https://arxiv.org/abs/2501.12884
作者: Konstantin Kutzkov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted for oral presentation at AISTATS 2025

点击查看摘要

Abstract:Random walk-based node embedding algorithms have attracted a lot of attention due to their scalability and ease of implementation. Previous research has focused on different walk strategies, optimization objectives, and embedding learning models. Inspired by observations on real data, we take a different approach and propose a new regularization technique. More precisely, the frequencies of node pairs generated by the skip-gram model on random walk node sequences follow a highly skewed distribution which causes learning to be dominated by a fraction of the pairs. We address the issue by designing an efficient sampling procedure that generates node pairs according to their \em smoothed frequency. Theoretical and experimental results demonstrate the advantages of our approach.

[AI-16] Reinforcement learning Based Automated Design of Differential Evolution Algorithm for Black-box Optimization

链接: https://arxiv.org/abs/2501.12881
作者: Xu Yang,Rui Wang,Kaiwen Li,Ling Wang
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Differential evolution (DE) algorithm is recognized as one of the most effective evolutionary algorithms, demonstrating remarkable efficacy in black-box optimization due to its derivative-free nature. Numerous enhancements to the fundamental DE have been proposed, incorporating innovative mutation strategies and sophisticated parameter tuning techniques to improve performance. However, no single variant has proven universally superior across all problems. To address this challenge, we introduce a novel framework that employs reinforcement learning (RL) to automatically design DE for black-box optimization through meta-learning. RL acts as an advanced meta-optimizer, generating a customized DE configuration that includes an optimal initialization strategy, update rule, and hyperparameters tailored to a specific black-box optimization problem. This process is informed by a detailed analysis of the problem characteristics. In this proof-of-concept study, we utilize a double deep Q-network for implementation, considering a subset of 40 possible strategy combinations and parameter optimizations simultaneously. The framework’s performance is evaluated against black-box optimization benchmarks and compared with state-of-the-art algorithms. The experimental results highlight the promising potential of our proposed framework.

[AI-17] Drone Carrier: An Integrated Unmanned Surface Vehicle for Autonomous Inspection and Intervention in GNSS-Denied Maritime Environment

链接: https://arxiv.org/abs/2501.12869
作者: Yihao Dong,Muhayyu Ud Din,Francesco Lagala,Hailiang Kuang,Jianjun Sun,Siyuan Yang,Irfan Hussain,Shaoming He
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 15 pages, 12pages

点击查看摘要

Abstract:This paper introduces an innovative drone carrier concept that is applied in maritime port security or offshore rescue. This system works with a heterogeneous system consisting of multiple Unmanned Aerial Vehicles (UAVs) and Unmanned Surface Vehicles (USVs) to perform inspection and intervention tasks in GNSS-denied or interrupted environments. The carrier, an electric catamaran measuring 4m by 7m, features a 4m by 6m deck supporting automated takeoff and landing for four DJI M300 drones, along with a 10kg-payload manipulator operable in up to level 3 sea conditions. Utilizing an offshore gimbal camera for navigation, the carrier can autonomously navigate, approach and dock with non-cooperative vessels, guided by an onboard camera, LiDAR, and Doppler Velocity Log (DVL) over a 3 km ^2 area. UAVs equipped with onboard Ultra-Wideband (UWB) technology execute mapping, detection, and manipulation tasks using a versatile gripper designed for wet, saline conditions. Additionally, two UAVs can coordinate to transport large objects to the manipulator or interact directly with them. These procedures are fully automated and were successfully demonstrated at the Mohammed Bin Zayed International Robotic Competition (MBZIRC2024), where the drone carrier equipped with four UAVS and one manipulator, automatically accomplished the intervention tasks in sea-level-3 (wave height 1.25m) based on the rough target information.

[AI-18] As Confidence Aligns: Exploring the Effect of AI Confidence on Human Self-confidence in Human-AI Decision Making

链接: https://arxiv.org/abs/2501.12868
作者: Jingshu Li,Yitian Yang,Q. Vera Liao,Junti Zhang,Yi-Chieh Lee
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Complementary collaboration between humans and AI is essential for human-AI decision making. One feasible approach to achieving it involves accounting for the calibrated confidence levels of both AI and users. However, this process would likely be made more difficult by the fact that AI confidence may influence users’ self-confidence and its calibration. To explore these dynamics, we conducted a randomized behavioral experiment. Our results indicate that in human-AI decision-making, users’ self-confidence aligns with AI confidence and such alignment can persist even after AI ceases to be involved. This alignment then affects users’ self-confidence calibration. We also found the presence of real-time correctness feedback of decisions reduced the degree of alignment. These findings suggest that users’ self-confidence is not independent of AI confidence, which practitioners aiming to achieve better human-AI collaboration need to be aware of. We call for research focusing on the alignment of human cognition and behavior with AI.

[AI-19] Mutation-Guided LLM -based Test Generation at Meta

链接: https://arxiv.org/abs/2501.12862
作者: Christopher Foster,Abhishek Gulati,Mark Harman,Inna Harper,Ke Mao,Jillian Ritchey,Hervé Robert,Shubho Sengupta
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to FSE 2025 Industry Track

点击查看摘要

Abstract:This paper describes Meta’s ACH system for mutation-guided LLM-based test generation. ACH generates relatively few mutants (aka simulated faults), compared to traditional mutation testing. Instead, it focuses on generating currently undetected faults that are specific to an issue of concern. From these currently uncaught faults, ACH generates tests that can catch them, thereby `killing’ the mutants and consequently hardening the platform against regressions. We use privacy concerns to illustrate our approach, but ACH can harden code against \em any type of regression. In total, ACH was applied to 10,795 Android Kotlin classes in 7 software platforms deployed by Meta, from which it generated 9,095 mutants and 571 privacy-hardening test cases. ACH also deploys an LLM-based equivalent mutant detection agent that achieves a precision of 0.79 and a recall of 0.47 (rising to 0.95 and 0.96 with simple pre-processing). ACH was used by Messenger and WhatsApp test-a-thons where engineers accepted 73% of its tests, judging 36% to privacy relevant. We conclude that ACH hardens code against specific concerns and that, even when its tests do not directly tackle the specific concern, engineers find them useful for their other benefits.

[AI-20] o Measure or Not: A Cost-Sensitive Selective Measuring Environment for Agricultural Management Decisions with Reinforcement Learning AAAI

链接: https://arxiv.org/abs/2501.12823
作者: Hilmy Baja,Michiel Kallenberg,Ioannis N. Athanasiadis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures, accepted after peer-review at the 39th Annual AAAI Conference on Artificial Intelligence, AI for Social Impact Track, February 2025, Philadelphia, Pennsylvania, USA

点击查看摘要

Abstract:Farmers rely on in-field observations to make well-informed crop management decisions to maximize profit and minimize adverse environmental impact. However, obtaining real-world crop state measurements is labor-intensive, time-consuming and expensive. In most cases, it is not feasible to gather crop state measurements before every decision moment. Moreover, in previous research pertaining to farm management optimization, these observations are often assumed to be readily available without any cost, which is unrealistic. Hence, enabling optimization without the need to have temporally complete crop state observations is important. An approach to that problem is to include measuring as part of decision making. As a solution, we apply reinforcement learning (RL) to recommend opportune moments to simultaneously measure crop features and apply nitrogen fertilizer. With realistic considerations, we design an RL environment with explicit crop feature measuring costs. While balancing costs, we find that an RL agent, trained with recurrent PPO, discovers adaptive measuring policies that follow critical crop development stages, with results aligned by what domain experts would consider a sensible approach. Our results highlight the importance of measuring when crop feature measurements are not readily available.

[AI-21] Unveiling Zero-Space Detection: A Novel Framework for Autonomous Ransomware Identification in High-Velocity Environments

链接: https://arxiv.org/abs/2501.12811
作者: Lafedi Svet,Arthur Brightwell,Augustus Wildflower,Cecily Marshwood
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern cybersecurity landscapes increasingly demand sophisticated detection frameworks capable of identifying evolving threats with precision and adaptability. The proposed Zero-Space Detection framework introduces a novel approach that dynamically identifies latent behavioral patterns through unsupervised clustering and advanced deep learning techniques. Designed to address the limitations of signature-based and heuristic methods, it operates effectively in high-velocity environments by integrating multi-phase filtering and ensemble learning for refined decision-making. Experimental evaluation reveals high detection rates across diverse ransomware families, including LockBit, Conti, REvil, and BlackMatter, while maintaining low false positive rates and scalable performance. Computational overhead remains minimal, with average processing times ensuring compatibility with real-time systems even under peak operational loads. The framework demonstrates resilience against adversarial strategies such as obfuscation and encryption speed variability, which frequently challenge conventional detection systems. Analysis across multiple data sources highlights its versatility in handling diverse file types and operational contexts. Comprehensive metrics, including detection probability, latency, and resource efficiency, validate its efficacy under real-world conditions. Through its modular architecture, the framework achieves seamless integration with existing cybersecurity infrastructures without significant reconfiguration. The results demonstrate its robustness and scalability, offering a transformative paradigm for ransomware identification in dynamic and resource-constrained environments.

[AI-22] Revisit Self-Debugging with Self-Generated Tests for Code Generation

链接: https://arxiv.org/abs/2501.12793
作者: Xiancai Chen,Zhengwei Tao,Kechi Zhang,Changzhi Zhou,Wanli Gu,Yuanpeng He,Mengdi Zhang,Xunliang Cai,Haiyan Zhao,Zhi Jin
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Work in Progress

点击查看摘要

Abstract:Large language models (LLMs) have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of code generation by leveraging execution feedback from tests. Despite its promise, the availability of high-quality tests in real-world scenarios is limited. In this context, self-debugging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. Therefore, we investigate its efficacy on diverse programming problems. To deepen our understanding, we propose two distinct paradigms for the process: post-execution and in-execution self-debugging. Within the scope of self-contained Python programming tasks, we find that post-execution self-debugging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests. On the other hand, in-execution self-debugging enables LLMs to mitigate the bias by solely leveraging intermediate states during execution, thereby enhancing code generation.

[AI-23] On Tradeoffs in Learning-Augmented Algorithms AISTATS2024

链接: https://arxiv.org/abs/2501.12770
作者: Ziyad Benomar,Vianney Perchet
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a conference paper at AISTATS 2024

点击查看摘要

Abstract:The field of learning-augmented algorithms has gained significant attention in recent years. These algorithms, using potentially inaccurate predictions, must exhibit three key properties: consistency, robustness, and smoothness. In scenarios where distributional information about predictions is available, a strong expected performance is required. Typically, the design of these algorithms involves a natural tradeoff between consistency and robustness, and previous works aimed to achieve Pareto-optimal tradeoffs for specific problems. However, in some settings, this comes at the expense of smoothness. This paper demonstrates that certain problems involve multiple tradeoffs between consistency, robustness, smoothness, and average performance.

[AI-24] Estimating the Conformal Prediction Threshold from Noisy Labels

链接: https://arxiv.org/abs/2501.12749
作者: Coby Penso,Jacob Goldberger,Ethan Fetaya
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal Prediction (CP) is a method to control prediction uncertainty by producing a small prediction set, ensuring a predetermined probability that the true class lies within this set. This is commonly done by defining a score, based on the model predictions, and setting a threshold on this score using a validation set. In this study, we address the problem of CP calibration when we only have access to a validation set with noisy labels. We show how we can estimate the noise-free conformal threshold based on the noisy labeled data. Our solution is flexible and can accommodate various modeling assumptions regarding the label contamination process, without needing any information about the underlying data distribution or the internal mechanisms of the machine learning classifier. We develop a coverage guarantee for uniform noise that is effective even in tasks with a large number of classes. We dub our approach Noise-Aware Conformal Prediction (NACP) and show on several natural and medical image classification datasets, including ImageNet, that it significantly outperforms current noisy label methods and achieves results comparable to those obtained with a clean validation set.

[AI-25] A Call for Critically Rethinking and Reforming Data Analysis in Empirical Software Engineering

链接: https://arxiv.org/abs/2501.12728
作者: Matteo Esposito,Mikel Robredo,Murali Sridharan,Guilherme Horta Travassos,Rafael Peñaloza,Valentina Lenarduzzi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:Context: Empirical Software Engineering (ESE) drives innovation in SE through qualitative and quantitative studies. However, concerns about the correct application of empirical methodologies have existed since the 2006 Dagstuhl seminar on SE. Objective: To analyze three decades of SE research, identify mistakes in statistical methods, and evaluate experts’ ability to detect and address these issues. Methods: We conducted a literature survey of ~27,000 empirical studies, using LLMs to classify statistical methodologies as adequate or inadequate. Additionally, we selected 30 primary studies and held a workshop with 33 ESE experts to assess their ability to identify and resolve statistical issues. Results: Significant statistical issues were found in the primary studies, and experts showed limited ability to detect and correct these methodological problems, raising concerns about the broader ESE community’s proficiency in this area. Conclusions. Despite our study’s eventual limitations, its results shed light on recurring issues from promoting information copy-and-paste from past authors’ works and the continuous publication of inadequate approaches that promote dubious results and jeopardize the spread of the correct statistical strategies among researchers. Besides, it justifies further investigation into empirical rigor in software engineering to expose these recurring issues and establish a framework for reassessing our field’s foundation of statistical methodology application. Therefore, this work calls for critically rethinking and reforming data analysis in empirical software engineering, paving the way for our work soon.

[AI-26] HEPPO: Hardware-Efficient Proximal Policy Optimization – A Universal Pipelined Architecture for Generalized Advantage Estimation

链接: https://arxiv.org/abs/2501.12703
作者: Hazem Taha,Ameer M. S. Abdelhadi
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the 2024 International Conference on Field Programmable Technology (ICFPT 2023)

点击查看摘要

Abstract:This paper introduces HEPPO, an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation (GAE) stage in Proximal Policy Optimization (PPO). Unlike previous approaches that focused on trajectory collection and actor-critic updates, HEPPO addresses GAE’s computational demands with a parallel, pipelined architecture implemented on a single System-on-Chip (SoC). This design allows for the adaptation of various hardware accelerators tailored for different PPO phases. A key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization. This method stabilizes learning, enhances performance, and manages memory bottlenecks, achieving a 4x reduction in memory usage and a 1.5x increase in cumulative rewards. We propose a solution on a single SoC device with programmable logic and embedded processors, delivering throughput orders of magnitude higher than traditional CPU-GPU systems. Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency. Experimental results show a 30% increase in PPO speed and a substantial reduction in memory access time, underscoring HEPPO’s potential for broad applicability in hardware-efficient reinforcement learning algorithms.

[AI-27] Growth strategies for arbitrary DAG neural architectures

链接: https://arxiv.org/abs/2501.12690
作者: Stella Douka(LISN, TAU),Manon Verbockhaven(LISN, TAU),Théo Rudkiewicz(ENS Paris Saclay, LISN, TAU),Stéphane Rivaud(LISN, TAU),François P Landes(LISN, TAU),Sylvain Chevallier(LISN, TAU),Guillaume Charpiat(LISN, TAU)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning has shown impressive results obtained at the cost of training huge neural networks. However, the larger the architecture, the higher the computational, financial, and environmental costs during training and inference. We aim at reducing both training and inference durations. We focus on Neural Architecture Growth, which can increase the size of a small model when needed, directly during training using information from the backpropagation. We expand existing work and freely grow neural networks in the form of any Directed Acyclic Graph by reducing expressivity bottlenecks in the architecture. We explore strategies to reduce excessive computations and steer network growth toward more parameter-efficient architectures.

[AI-28] NBDI: A Simple and Efficient Termination Condition for Skill Extraction from Task-Agnostic Demonstrations

链接: https://arxiv.org/abs/2501.12668
作者: Myunsoo Kim,Hayeong Lee,Seong-Woong Shim,JunHo Seo,Byung-Jun Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, long-horizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential for further exploration and faster policy learning. In this work, we propose to learn a simple and efficient termination condition that identifies decision points through a state-action novelty module that leverages agent experience data. Our approach, Novelty-based Decision Point Identification (NBDI), outperforms previous baselines in complex, long-horizon tasks, and remains effective even in the presence of significant variations in the environment configurations of downstream tasks, highlighting the importance of decision point identification in skill learning.

[AI-29] Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors

链接: https://arxiv.org/abs/2501.12633
作者: Jingyang Ke,Feiyang Wu,Jiyi Wang,Jeffrey Markowitz,Anqi Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional approaches to studying decision-making in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWitching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.

[AI-30] owards Robust Multi-tab Website Fingerprinting

链接: https://arxiv.org/abs/2501.12622
作者: Xinhao Deng,Xiyuan Zhao,Qilei Yin,Zhuotao Liu,Qi Li,Mingwei Xu,Ke Xu,Jianping Wu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Website fingerprinting enables an eavesdropper to determine which websites a user is visiting over an encrypted connection. State-of-the-art website fingerprinting (WF) attacks have demonstrated effectiveness even against Tor-protected network traffic. However, existing WF attacks have critical limitations on accurately identifying websites in multi-tab browsing sessions, where the holistic pattern of individual websites is no longer preserved, and the number of tabs opened by a client is unknown a priori. In this paper, we propose ARES, a novel WF framework natively designed for multi-tab WF attacks. ARES formulates the multi-tab attack as a multi-label classification problem and solves it using the novel Transformer-based models. Specifically, ARES extracts local patterns based on multi-level traffic aggregation features and utilizes the improved self-attention mechanism to analyze the correlations between these local patterns, effectively identifying websites. We implement a prototype of ARES and extensively evaluate its effectiveness using our large-scale datasets collected over multiple months. The experimental results illustrate that ARES achieves optimal performance in several realistic scenarios. Further, ARES remains robust even against various WF defenses.

[AI-31] Adaptive Data Exploitation in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2501.12620
作者: Mingqi Yuan,Bo Li,Xin Jin,Wenjun Zeng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 40 pages, 37 figures

点击查看摘要

Abstract:We introduce ADEPT: Adaptive Data ExPloiTation, a simple yet powerful framework to enhance the data efficiency and generalization in deep reinforcement learning (RL). Specifically, ADEPT adaptively manages the use of sampled data across different learning stages via multi-armed bandit (MAB) algorithms, optimizing data utilization while mitigating overfitting. Moreover, ADEPT can significantly reduce the computational overhead and accelerate a wide range of RL algorithms. We test ADEPT on benchmarks including Procgen, MiniGrid, and PyBullet. Extensive simulation demonstrates that ADEPT can achieve superior performance with remarkable computational efficiency, offering a practical solution to data-efficient RL. Our code is available at this https URL.

[AI-32] Deep Learning-Based Identification of Inconsistent Method Names: How Far Are We?

链接: https://arxiv.org/abs/2501.12617
作者: Taiming Wang,Yuxia Zhang,Lin Jiang,Yi Tang,Guangjie Li,Hui Liu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Concise and meaningful method names are crucial for program comprehension and maintenance. However, method names may become inconsistent with their corresponding implementations, causing confusion and errors. Several deep learning (DL)-based approaches have been proposed to identify such inconsistencies, with initial evaluations showing promising results. However, these evaluations typically use a balanced dataset, where the number of inconsistent and consistent names are equal. This setup, along with flawed dataset construction, leads to false positives, making reported performance less reliable in real-world scenarios, where most method names are consistent. In this paper, we present an empirical study that evaluates state-of-the-art DL-based methods for identifying inconsistent method names. We create a new benchmark by combining automatic identification from commit histories and manual developer inspections, reducing false positives. We evaluate five representative DL approaches (one retrieval-based and four generation-based) on this benchmark. Our results show that performance drops substantially when moving from the balanced dataset to the new benchmark. We further conduct quantitative and qualitative analyses to understand the strengths and weaknesses of the approaches. Retrieval-based methods perform well on simple methods and those with popular name sub-tokens but fail due to inefficient representation techniques. Generation-based methods struggle with inaccurate similarity calculations and immature name generation. Based on these findings, we propose improvements using contrastive learning and large language models (LLMs). Our study suggests that significant improvements are needed before these DL approaches can be effectively applied to real-world software systems.

[AI-33] Kimi k1.5: Scaling Reinforcement Learning with LLM s

链接: https://arxiv.org/abs/2501.12599
作者: Kimi Team,Angang Du,Bofei Gao,Bowei Xing,Changjiu Jiang,Cheng Chen,Cheng Li,Chenjun Xiao,Chenzhuang Du,Chonghua Liao,Chuning Tang,Congcong Wang,Dehao Zhang,Enming Yuan,Enzhe Lu,Fengxiang Tang,Flood Sung,Guangda Wei,Guokun Lai,Haiqing Guo,Han Zhu,Hao Ding,Hao Hu,Hao Yang,Hao Zhang,Haotian Yao,Haotian Zhao,Haoyu Lu,Haoze Li,Haozhen Yu,Hongcheng Gao,Huabin Zheng,Huan Yuan,Jia Chen,Jianhang Guo,Jianlin Su,Jianzhou Wang,Jie Zhao,Jin Zhang,Jingyuan Liu,Junjie Yan,Junyan Wu,Lidong Shi,Ling Ye,Longhui Yu,Mengnan Dong,Neo Zhang,Ningchen Ma,Qiwei Pan,Qucheng Gong,Shaowei Liu,Shengling Ma,Shupeng Wei,Sihan Cao,Siying Huang,Tao Jiang,Weihao Gao,Weimin Xiong,Weiran He,Weixiao Huang,Wenhao Wu,Wenyang He,Xianghui Wei,Xianqing Jia,Xingzhe Wu,Xinran Xu,Xinxing Zu,Xinyu Zhou,Xuehai Pan,Y. Charles,Yang Li,Yangyang Hu,Yangyang Liu,Yanru Chen,Yejie Wang,Yibo Liu,Yidao Qin,Yifeng Liu,Ying Yang,Yiping Bao,Yulun Du,Yuxin Wu,Yuzhi Wang,Zaida Zhou,Zhaoji Wang,Zhaowei Li,Zhen Zhu,Zheng Zhang,Zhexu Wang,Zhilin Yang,Zhiqi Huang,Zihao Huang,Ziyao Xu,Zonghan Yang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities – e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista – matching OpenAI’s o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results – e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench – outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).

[AI-34] A Unified Invariant Learning Framework for Graph Classification KDD2025

链接: https://arxiv.org/abs/2501.12595
作者: Yongduo Sui,Jie Sun,Shuyao Wang,Zemin Liu,Qing Cui,Longfei Li,Xiang Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to KDD 2025

点击查看摘要

Abstract:Invariant learning demonstrates substantial potential for enhancing the generalization of graph neural networks (GNNs) with out-of-distribution (OOD) data. It aims to recognize stable features in graph data for classification, based on the premise that these features causally determine the target label, and their influence is invariant to changes in distribution. Along this line, most studies have attempted to pinpoint these stable features by emphasizing explicit substructures in the graph, such as masked or attentive subgraphs, and primarily enforcing the invariance principle in the semantic space, i.e., graph representations. However, we argue that focusing only on the semantic space may not accurately identify these stable features. To address this, we introduce the Unified Invariant Learning (UIL) framework for graph classification. It provides a unified perspective on invariant graph learning, emphasizing both structural and semantic invariance principles to identify more robust stable features. In the graph space, UIL adheres to the structural invariance principle by reducing the distance between graphons over a set of stable features across different environments. Simultaneously, to confirm semantic invariance, UIL underscores that the acquired graph representations should demonstrate exemplary performance across diverse environments. We present both theoretical and empirical evidence to confirm our method’s ability to recognize superior stable features. Moreover, through a series of comprehensive experiments complemented by in-depth analyses, we demonstrate that UIL considerably enhances OOD generalization, surpassing the performance of leading baseline methods. Our codes are available at this https URL.

[AI-35] FedGrAINS: Personalized SubGraph Federated Learning with Adaptive Neighbor Sampling SDM2025

链接: https://arxiv.org/abs/2501.12592
作者: Emir Ceyani,Han Xie,Baturalp Buyukates,Carl Yang,Salman Avestimehr
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
*备注: Accepted to SDM2025 (SIAM Data Mining 2025)

点击查看摘要

Abstract:Graphs are crucial for modeling relational and biological data. As datasets grow larger in real-world scenarios, the risk of exposing sensitive information increases, making privacy-preserving training methods like federated learning (FL) essential to ensure data security and compliance with privacy regulations. Recently proposed personalized subgraph FL methods have become the de-facto standard for training personalized Graph Neural Networks (GNNs) in a federated manner while dealing with the missing links across clients’ subgraphs due to privacy restrictions. However, personalized subgraph FL faces significant challenges due to the heterogeneity in client subgraphs, such as degree distributions among the nodes, which complicate federated training of graph models. To address these challenges, we propose \textitFedGrAINS, a novel data-adaptive and sampling-based regularization method for subgraph FL. FedGrAINS leverages generative flow networks (GFlowNets) to evaluate node importance concerning clients’ tasks, dynamically adjusting the message-passing step in clients’ GNNs. This adaptation reflects task-optimized sampling aligned with a trajectory balance objective. Experimental results demonstrate that the inclusion of \textitFedGrAINS as a regularizer consistently improves the FL performance compared to baselines that do not leverage such regularization.

[AI-36] Leverag ing LLM s to Create a Haptic Devices Recommendation System

链接: https://arxiv.org/abs/2501.12573
作者: Yang Liu,Haiwei Dong,Abdulmotaleb El Saddik
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Haptic technology has seen significant growth, yet a lack of awareness of existing haptic device design knowledge hinders development. This paper addresses these limitations by leveraging advancements in Large Language Models (LLMs) to develop a haptic agent, focusing specifically on Grounded Force Feedback (GFF) devices recommendation. Our approach involves automating the creation of a structured haptic device database using information from research papers and product specifications. This database enables the recommendation of relevant GFF devices based on user queries. To ensure precise and contextually relevant recommendations, the system employs a dynamic retrieval method that combines both conditional and semantic searches. Benchmarking against the established UEQ and existing haptic device searching tools, the proposed haptic recommendation agent ranks in the top 10% across all UEQ categories with mean differences favoring the agent in nearly all subscales, and maintains no significant performance bias across different user groups, showcasing superior usability and user satisfaction.

[AI-37] Reinforcement Learning Constrained Beam Search for Parameter Optimization of Paper Drying Under Flexible Constraints

链接: https://arxiv.org/abs/2501.12542
作者: Siyuan Chen,Hanshen Yu,Jamal Yagoobi,Chenhui Shao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Existing approaches to enforcing design constraints in Reinforcement Learning (RL) applications often rely on training-time penalties in the reward function or training/inference-time invalid action masking, but these methods either cannot be modified after training, or are limited in the types of constraints that can be implemented. To address this limitation, we propose Reinforcement Learning Constrained Beam Search (RLCBS) for inference-time refinement in combinatorial optimization problems. This method respects flexible, inference-time constraints that support exclusion of invalid actions and forced inclusion of desired actions, and employs beam search to maximize sequence probability for more sensible constraint incorporation. RLCBS is extensible to RL-based planning and optimization problems that do not require real-time solution, and we apply the method to optimize process parameters for a novel modular testbed for paper drying. An RL agent is trained to minimize energy consumption across varying machine speed levels by generating optimal dryer module and air supply temperature configurations. Our results demonstrate that RLCBS outperforms NSGA-II under complex design constraints on drying module configurations at inference-time, while providing a 2.58-fold or higher speed improvement.

[AI-38] Interaction Dataset of Autonomous Vehicles with Traffic Lights and Signs

链接: https://arxiv.org/abs/2501.12536
作者: Zheng Li,Zhipeng Bao,Haoming Meng,Haotian Shi,Qianwen Li,Handong Yao,Xiaopeng Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents the development of a comprehensive dataset capturing interactions between Autonomous Vehicles (AVs) and traffic control devices, specifically traffic lights and stop signs. Derived from the Waymo Motion dataset, our work addresses a critical gap in the existing literature by providing real-world trajectory data on how AVs navigate these traffic control devices. We propose a methodology for identifying and extracting relevant interaction trajectory data from the Waymo Motion dataset, incorporating over 37,000 instances with traffic lights and 44,000 with stop signs. Our methodology includes defining rules to identify various interaction types, extracting trajectory data, and applying a wavelet-based denoising method to smooth the acceleration and speed profiles and eliminate anomalous values, thereby enhancing the trajectory quality. Quality assessment metrics indicate that trajectories obtained in this study have anomaly proportions in acceleration and jerk profiles reduced to near-zero levels across all interaction categories. By making this dataset publicly available, we aim to address the current gap in datasets containing AV interaction behaviors with traffic lights and signs. Based on the organized and published dataset, we can gain a more in-depth understanding of AVs’ behavior when interacting with traffic lights and signs. This will facilitate research on AV integration into existing transportation infrastructures and networks, supporting the development of more accurate behavioral models and simulation tools.

[AI-39] An Empirically-grounded tool for Automatic Prompt Linting and Repair: A Case Study on Bias Vulnerability and Optimization in Developer Prompts

链接: https://arxiv.org/abs/2501.12521
作者: Dhia Elhaq Rzig,Dhruba Jyoti Paul,Kaiser Pister,Jordan Henkel,Foyzul Hassan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The tidal wave of advancements in Large Language Models (LLMs) has led to their swift integration into application-level logic. Many software systems now use prompts to interact with these black-box models, combining natural language with dynamic values interpolated at runtime, to perform tasks ranging from sentiment analysis to question answering. Due to the programmatic and structured natural language aspects of these prompts, we refer to them as Developer Prompts. Unlike traditional software artifacts, Dev Prompts blend natural language instructions with artificial languages such as programming and markup languages, thus requiring specialized tools for analysis, distinct from classical software evaluation methods. In response to this need, we introduce PromptDoctor, a tool explicitly designed to detect and correct issues of Dev Prompts. PromptDoctor identifies and addresses problems related to bias, vulnerability, and sub-optimal performance in Dev Prompts, helping mitigate their possible harms. In our analysis of 2,173 Dev Prompts, selected as a representative sample of 40,573 Dev Prompts, we found that 3.46% contained one or more forms of bias, 10.75% were vulnerable to prompt injection attacks. Additionally, 3,310 were amenable to automated prompt optimization. To address these issues, we applied PromptDoctor to the flawed Dev Prompts we discovered. PromptDoctor de-biased 68.29% of the biased Dev Prompts, hardened 41.81% of the vulnerable Dev Prompts, and improved the performance of 37.1% sub-optimal Dev Prompts. Finally, we developed a PromptDoctor VSCode extension, enabling developers to easily enhance Dev Prompts in their existing development workflows. The data and source code for this work are available at Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.12521 [cs.SE] (or arXiv:2501.12521v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.12521 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] he Finite Element Neural Network Method: One Dimensional Study

链接: https://arxiv.org/abs/2501.12508
作者: Mohammed Abda,Elsa Piollet,Christopher Blake,Frédérick P. Gosselin
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注: 27 pages, 13 figures

点击查看摘要

Abstract:The potential of neural networks (NN) in engineering is rooted in their capacity to understand intricate patterns and complex systems, leveraging their universal nonlinear approximation capabilities and high expressivity. Meanwhile, conventional numerical methods, backed by years of meticulous refinement, continue to be the standard for accuracy and dependability. Bridging these paradigms, this research introduces the finite element neural network method (FENNM) within the framework of the Petrov-Galerkin method using convolution operations to approximate the weighted residual of the differential equations. The NN generates the global trial solution, while the test functions belong to the Lagrange test function space. FENNM introduces several key advantages. Notably, the weak-form of the differential equations introduces flux terms that contribute information to the loss function compared to VPINN, hp-VPINN, and cv-PINN. This enables the integration of forcing terms and natural boundary conditions into the loss function similar to conventional finite element method (FEM) solvers, facilitating its optimization, and extending its applicability to more complex problems, which will ease industrial adoption. This study will elaborate on the derivation of FENNM, highlighting its similarities with FEM. Additionally, it will provide insights into optimal utilization strategies and user guidelines to ensure cost-efficiency. Finally, the study illustrates the robustness and accuracy of FENNM by presenting multiple numerical case studies and applying adaptive mesh refinement techniques.

[AI-41] R2D2: Remembering Reflecting and Dynamic Decision Making for Web Agents

链接: https://arxiv.org/abs/2501.12485
作者: Tenghao Huang,Kinjal Basu,Ibrahim Abdelaziz,Pavan Kapanipathi,Jonathan May,Muhao Chen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The proliferation of web agents necessitates advanced navigation and interaction strategies within complex web environments. Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures. Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect. The Remember paradigm utilizes a replay buffer that aids agents in reconstructing the web environment dynamically, thus enabling the formulation of a detailed ``map’’ of previously visited pages. This helps in reducing navigational errors and optimizing the decision-making process during web interactions. Conversely, the Reflect paradigm allows agents to learn from past mistakes by providing a mechanism for error analysis and strategy refinement, enhancing overall task performance. We evaluate R2D2 using the WEBARENA benchmark, demonstrating significant improvements over existing methods, including a 50% reduction in navigation errors and a threefold increase in task completion rates. Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents, potentially benefiting various applications such as automated customer service and personal digital assistants.

[AI-42] Degree-Based Logical Adjacency Checking (DBLAC): A Novel Heuristic for Vertex Coloring

链接: https://arxiv.org/abs/2501.12479
作者: Prashant Verma
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Degree Based Logical Adjacency Checking (DBLAC). An efficient coloring of graphs with unique logical AND operations. The logical AND operation shows more effective color assignment and fewer number of induced colors in the case of common edges between vertices. In this work, we provide a detailed theoretical analysis of DBLAC’s time and space complexity. It furthermore shows its effectiveness through prolonged experiments on standard benchmark graphs. We compare it with existing algorithms, namely DSATUR and Recursive Largest First (RLF). Second, we show how DBLAC achieves competitive results with respect to both the number of colors used and runtime performance.

[AI-43] Adaptive PII Mitigation Framework for Large Language Models AAAI

链接: https://arxiv.org/abs/2501.12465
作者: Shubhi Asthana,Ruchi Mahindru,Bing Zhang,Jorge Sanz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: This paper has been accepted at PPAI-25, the 6th AAAI Workshop on Privacy-Preserving Artificial Intelligence

点击查看摘要

Abstract:Artificial Intelligence (AI) faces growing challenges from evolving data protection laws and enforcement practices worldwide. Regulations like GDPR and CCPA impose strict compliance requirements on Machine Learning (ML) models, especially concerning personal data use. These laws grant individuals rights such as data correction and deletion, complicating the training and deployment of Large Language Models (LLMs) that rely on extensive datasets. Public data availability does not guarantee its lawful use for ML, amplifying these challenges. This paper introduces an adaptive system for mitigating risk of Personally Identifiable Information (PII) and Sensitive Personal Information (SPI) in LLMs. It dynamically aligns with diverse regulatory frameworks and integrates seamlessly into Governance, Risk, and Compliance (GRC) systems. The system uses advanced NLP techniques, context-aware analysis, and policy-driven masking to ensure regulatory compliance. Benchmarks highlight the system’s effectiveness, with an F1 score of 0.95 for Passport Numbers, outperforming tools like Microsoft Presidio (0.33) and Amazon Comprehend (0.54). In human evaluations, the system achieved an average user trust score of 4.6/5, with participants acknowledging its accuracy and transparency. Observations demonstrate stricter anonymization under GDPR compared to CCPA, which permits pseudonymization and user opt-outs. These results validate the system as a scalable and robust solution for enterprise privacy compliance. Comments: This paper has been accepted at PPAI-25, the 6th AAAI Workshop on Privacy-Preserving Artificial Intelligence Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2501.12465 [cs.LG] (or arXiv:2501.12465v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.12465 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-44] Deploying Privacy Guardrails for LLM s: A Comparative Analysis of Real-World Applications AAAI2025

链接: https://arxiv.org/abs/2501.12456
作者: Shubhi Asthana,Bing Zhang,Ruchi Mahindru,Chad DeLuca,Anna Lisa Gentile,Sandeep Gopisetty
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: This paper has been accepted at Deployable AI workshop at AAAI 2025

点击查看摘要

Abstract:The adoption of Large Language Models (LLMs) has revolutionized AI applications but poses significant challenges in safeguarding user privacy. Ensuring compliance with privacy regulations such as GDPR and CCPA while addressing nuanced privacy risks requires robust and scalable frameworks. This paper presents a detailed study of OneShield Privacy Guard, a framework designed to mitigate privacy risks in user inputs and LLM outputs across enterprise and open-source settings. We analyze two real-world deployments:(1) a multilingual privacy-preserving system integrated with Data and Model Factory, focusing on enterprise-scale data governance; and (2) PR Insights, an open-source repository emphasizing automated triaging and community-driven refinements. In Deployment 1, OneShield achieved a 0.95 F1 score in detecting sensitive entities like dates, names, and phone numbers across 26 languages, outperforming state-of-the-art tool such as StarPII and Presidio by up to 12%. Deployment 2, with an average F1 score of 0.86, reduced manual effort by over 300 hours in three months, accurately flagging 8.25% of 1,256 pull requests for privacy risks with enhanced context sensitivity. These results demonstrate OneShield’s adaptability and efficacy in diverse environments, offering actionable insights for context-aware entity recognition, automated compliance, and ethical AI adoption. This work advances privacy-preserving frameworks, supporting user trust and compliance across operational contexts.

[AI-45] Enhancing Retrosynthesis with Conformer: A Template-Free Method

链接: https://arxiv.org/abs/2501.12434
作者: Jiaxi Zhuang,Qian Zhang,Ying Qian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrosynthesis plays a crucial role in the fields of organic synthesis and drug development, where the goal is to identify suitable reactants that can yield a target product molecule. Although existing methods have achieved notable success, they typically overlook the 3D conformational details and internal spatial organization of molecules. This oversight makes it challenging to predict reactants that conform to genuine chemical principles, particularly when dealing with complex molecular structures, such as polycyclic and heteroaromatic compounds. In response to this challenge, we introduce a novel transformer-based, template-free approach that incorporates 3D conformer data and spatial information. Our approach includes an Atom-align Fusion module that integrates 3D positional data at the input stage, ensuring correct alignment between atom tokens and their respective 3D coordinates. Additionally, we propose a Distance-weighted Attention mechanism that refines the self-attention process, constricting the model s focus to relevant atom pairs in 3D space. Extensive experiments on the USPTO-50K dataset demonstrate that our model outperforms previous template-free methods, setting a new benchmark for the field. A case study further highlights our method s ability to predict reasonable and accurate reactants.

[AI-46] SCFCRC: Simultaneously Counteract Feature Camouflage and Relation Camouflage for Fraud Detection

链接: https://arxiv.org/abs/2501.12430
作者: Xiaocheng Zhang,Zhuangzhuang Ye,GuoPing Zhao,Jianing Wang,Xiaohong Su
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In fraud detection, fraudsters often interact with many benign users, camouflaging their features or relations to hide themselves. Most existing work concentrates solely on either feature camouflage or relation camouflage, or decoupling feature learning and relation learning to avoid the two camouflage from affecting each other. However, this inadvertently neglects the valuable information derived from features or relations, which could mutually enhance their adversarial camouflage strategies. In response to this gap, we propose SCFCRC, a Transformer-based fraud detector that Simultaneously Counteract Feature Camouflage and Relation Camouflage. SCFCRC consists of two components: Feature Camouflage Filter and Relation Camouflage Refiner. The feature camouflage filter utilizes pseudo labels generated through label propagation to train the filter and uses contrastive learning that combines instance-wise and prototype-wise to improve the quality of features. The relation camouflage refiner uses Mixture-of-Experts(MoE) network to disassemble the multi-relations graph into multiple substructures and divide and conquer them to mitigate the degradation of detection performance caused by relation camouflage. Furthermore, we introduce a regularization method for MoE to enhance the robustness of the model. Extensive experiments on two fraud detection benchmark datasets demonstrate that our method outperforms state-of-the-art baselines.

[AI-47] Fuel Efficiency Analysis of the Public Transportation System Based on the Gaussian Mixture Model Clustering

链接: https://arxiv.org/abs/2501.12429
作者: Zhipeng Ma,Bo Nørregaard Jørgensen,Zheng Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Public transportation is a major source of greenhouse gas emissions, highlighting the need to improve bus fuel efficiency. Clustering algorithms assist in analyzing fuel efficiency by grouping data into clusters, but irrelevant features may complicate the analysis and choosing the optimal number of clusters remains a challenging task. Therefore, this paper employs the Gaussian mixture models to cluster the solo fuel-efficiency dataset. Moreover, an integration method that combines the Silhouette index, Calinski-Harabasz index, and Davies-Bouldin index is developed to select the optimal cluster numbers. A dataset with 4006 bus trips in North Jutland, Denmark is utilized as the case study. Trips are first split into three groups, then one group is divided further, resulting in four categories: extreme, normal, low, and extremely low fuel efficiency. A preliminary study using visualization analysis is conducted to investigate how driving behaviors and route conditions affect fuel efficiency. The results indicate that both individual driving habits and route characteristics have a significant influence on fuel efficiency.

[AI-48] SplitQuant: Layer Splitting for Low-Bit Neural Network Quantization

链接: https://arxiv.org/abs/2501.12428
作者: Jaewoo Song,Fangzhen Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in the EDGE AI Research Track 2025. This version is not the final version. The camera-ready version will be uploaded soon

点击查看摘要

Abstract:Quantization for deep neural networks (DNNs) is the process of mapping the parameter values of DNNs from original data types to other data types of lower precision to reduce model sizes and make inference faster. Quantization often maps different original values to a single quantized value because the range of the original values is larger than the range of the quantized values. This leads to the degradation of the accuracy of the quantized DNNs. Outliers are a main cause of the degradation of quantization resolution because they enlarge the range of original values. To solve the problem, the percentile method is often used to clip outliers. However, clipping the outliers has another problem of removing the important and strong signals in the DNNs. This paper proposes SplitQuant to keep the outliers and improve the quantization resolution at the same time. SplitQuant narrows down the range of the original values and mitigates the effect of outliers by splitting each quantizable layer into three mathematically equivalent layers and applies different scaling factors. Especially, weights and biases are clustered into lower, middle and upper clusters for optimized split. By preprocessing DNNs with SplitQuant, quantization algorithms can achieve better results. SplitQuant was applied on two BERT-Tiny models and improved the accuracy of INT2 quantization by 3.3%p and 2.1%p, achieving accuracies comparable to those of the original FP32 models.

[AI-49] SafePowerGraph-HIL: Real-Time HIL Validation of Heterogeneous GNNs for Bridging Sim-to-Real Gap in Power Grids

链接: https://arxiv.org/abs/2501.12427
作者: Aoxiang Ma,Salah Ghamizi,Jun Cao,Pedro Rodriguez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 5 figures

点击查看摘要

Abstract:As machine learning (ML) techniques gain prominence in power system research, validating these methods’ effectiveness under real-world conditions requires real-time hardware-in-the-loop (HIL) simulations. HIL simulation platforms enable the integration of computational models with physical devices, allowing rigorous testing across diverse scenarios critical to system resilience and reliability. In this study, we develop a SafePowerGraph-HIL framework that utilizes HIL simulations on the IEEE 9-bus system, modeled in Hypersim, to generate high-fidelity data, which is then transmitted in real-time via SCADA to an AWS cloud database before being input into a Heterogeneous Graph Neural Network (HGNN) model designed for power system state estimation and dynamic analysis. By leveraging Hypersim’s capabilities, we simulate complex grid interactions, providing a robust dataset that captures critical parameters for HGNN training. The trained HGNN is subsequently validated using newly generated data under varied system conditions, demonstrating accuracy and robustness in predicting power system states. The results underscore the potential of integrating HIL with advanced neural network architectures to enhance the real-time operational capabilities of power systems. This approach represents a significant advancement toward the development of intelligent, adaptive control strategies that support the robustness and resilience of evolving power grids.

[AI-50] Multi-Modality Collaborative Learning for Sentiment Analysis

链接: https://arxiv.org/abs/2501.12424
作者: Shanmin Wang,Chengguang Liu,Qingshan Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal sentiment analysis (MSA) identifies individuals’ sentiment states in videos by integrating visual, audio, and text modalities. Despite progress in existing methods, the inherent modality heterogeneity limits the effective capture of interactive sentiment features across modalities. In this paper, by introducing a Multi-Modality Collaborative Learning (MMCL) framework, we facilitate cross-modal interactions and capture enhanced and complementary features from modality-common and modality-specific representations, respectively. Specifically, we design a parameter-free decoupling module and separate uni-modality into modality-common and modality-specific components through semantics assessment of cross-modal elements. For modality-specific representations, inspired by the act-reward mechanism in reinforcement learning, we design policy models to adaptively mine complementary sentiment features under the guidance of a joint reward. For modality-common representations, intra-modal attention is employed to highlight crucial components, playing enhanced roles among modalities. Experimental results, including superiority evaluations on four databases, effectiveness verification of each module, and assessment of complementary features, demonstrate that MMCL successfully learns collaborative features across modalities and significantly improves performance. The code can be available at this https URL.

[AI-51] FREYR: A Framework for Recognizing and Executing Your Requests

链接: https://arxiv.org/abs/2501.12423
作者: Roberto Gallotta,Antonios Liapis,Georgios N. Yannakakis
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:Large language models excel as conversational agents, but their capabilities can be further extended through tool usage, i.e.: executable code, to enhance response accuracy or address specialized domains. Current approaches to enable tool usage often rely on model-specific prompting or fine-tuning a model for function-calling instructions. Both approaches have notable limitations, including reduced adaptability to unseen tools and high resource requirements. This paper introduces FREYR, a streamlined framework that modularizes the tool usage process into separate steps. Through this decomposition, we show that FREYR achieves superior performance compared to conventional tool usage methods. We evaluate FREYR on a set of real-world test cases specific for video game design and compare it against traditional tool usage as provided by the Ollama API.

[AI-52] ackling Small Sample Survival Analysis via Transfer Learning: A Study of Colorectal Cancer Prognosis

链接: https://arxiv.org/abs/2501.12421
作者: Yonghao Zhao,Changtao Li,Chi Shu,Qingbin Wu,Hong Li,Chuan Xu,Tianrui Li,Ziqiang Wang,Zhipeng Luo,Yazhou He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Survival prognosis is crucial for medical informatics. Practitioners often confront small-sized clinical data, especially cancer patient cases, which can be insufficient to induce useful patterns for survival predictions. This study deals with small sample survival analysis by leveraging transfer learning, a useful machine learning technique that can enhance the target analysis with related knowledge pre-learned from other data. We propose and develop various transfer learning methods designed for common survival models. For parametric models such as DeepSurv, Cox-CC (Cox-based neural networks), and DeepHit (end-to-end deep learning model), we apply standard transfer learning techniques like pretraining and fine-tuning. For non-parametric models such as Random Survival Forest, we propose a new transfer survival forest (TSF) model that transfers tree structures from source tasks and fine-tunes them with target data. We evaluated the transfer learning methods on colorectal cancer (CRC) prognosis. The source data are 27,379 SEER CRC stage I patients, and the target data are 728 CRC stage I patients from the West China Hospital. When enhanced by transfer learning, Cox-CC’s C^td value was boosted from 0.7868 to 0.8111, DeepHit’s from 0.8085 to 0.8135, DeepSurv’s from 0.7722 to 0.8043, and RSF’s from 0.7940 to 0.8297 (the highest performance). All models trained with data as small as 50 demonstrated even more significant improvement. Conclusions: Therefore, the current survival models used for cancer prognosis can be enhanced and improved by properly designed transfer learning techniques. The source code used in this study is available at this https URL.

[AI-53] Consolidating TinyML Lifecycle with Large Language Models : Reality Illusion or Opportunity?

链接: https://arxiv.org/abs/2501.12420
作者: Guanghan Wu,Sasu Tarkoma,Roberto Morabito
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper is currently under review for publication in an IEEE magazine. If accepted, the copyright will be transferred to IEEE. This work was presented at the TinyML Foundation (now Edge AI Foundation) event, titled Beyond LLMs and Chatbots: The Journey to Generative AI at the Edge. For more details, the presentation is available at this https URL

点击查看摘要

Abstract:The evolving requirements of Internet of Things (IoT) applications are driving an increasing shift toward bringing intelligence to the edge, enabling real-time insights and decision-making within resource-constrained environments. Tiny Machine Learning (TinyML) has emerged as a key enabler of this evolution, facilitating the deployment of ML models on devices such as microcontrollers and embedded systems. However, the complexity of managing the TinyML lifecycle, including stages such as data processing, model optimization and conversion, and device deployment, presents significant challenges and often requires substantial human intervention. Motivated by these challenges, we began exploring whether Large Language Models (LLMs) could help automate and streamline the TinyML lifecycle. We developed a framework that leverages the natural language processing (NLP) and code generation capabilities of LLMs to reduce development time and lower the barriers to entry for TinyML deployment. Through a case study involving a computer vision classification model, we demonstrate the framework’s ability to automate key stages of the TinyML lifecycle. Our findings suggest that LLM-powered automation holds potential for improving the lifecycle development process and adapting to diverse requirements. However, while this approach shows promise, there remain obstacles and limitations, particularly in achieving fully automated solutions. This paper sheds light on both the challenges and opportunities of integrating LLMs into TinyML workflows, providing insights into the path forward for efficient, AI-assisted embedded system development.

[AI-54] Control-ITRA: Controlling the Behavior of a Driving Model

链接: https://arxiv.org/abs/2501.12408
作者: Vasileios Lioutas,Adam Scibior,Matthew Niedoba,Berend Zwartsenberg,Frank Wood
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:Simulating realistic driving behavior is crucial for developing and testing autonomous systems in complex traffic environments. Equally important is the ability to control the behavior of simulated agents to tailor scenarios to specific research needs and safety considerations. This paper extends the general-purpose multi-agent driving behavior model ITRA (Scibior et al., 2021), by introducing a method called Control-ITRA to influence agent behavior through waypoint assignment and target speed modulation. By conditioning agents on these two aspects, we provide a mechanism for them to adhere to specific trajectories and indirectly adjust their aggressiveness. We compare different approaches for integrating these conditions during training and demonstrate that our method can generate controllable, infraction-free trajectories while preserving realism in both seen and unseen locations.

[AI-55] Data re-uploading in Quantum Machine Learning for time series: application to traffic forecasting

链接: https://arxiv.org/abs/2501.12776
作者: Nikolaos Schetakis,Paolo Bonfini,Negin Alisoltani,Konstantinos Blazakis,Symeon I. Tsintzos,Alexis Askitopoulos,Davit Aghamalyan,Panagiotis Fafoutellis,Eleni I. Vlahogianni
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Accurate traffic forecasting plays a crucial role in modern Intelligent Transportation Systems (ITS), as it enables real-time traffic flow management, reduces congestion, and improves the overall efficiency of urban transportation networks. With the rise of Quantum Machine Learning (QML), it has emerged a new paradigm possessing the potential to enhance predictive capabilities beyond what classical machine learning models can achieve. In the present work we pursue a heuristic approach to explore the potential of QML, and focus on a specific transport issue. In particular, as a case study we investigate a traffic forecast task for a major urban area in Athens (Greece), for which we possess high-resolution data. In this endeavor we explore the application of Quantum Neural Networks (QNN), and, notably, we present the first application of quantum data re-uploading in the context of transport forecasting. This technique allows quantum models to better capture complex patterns, such as traffic dynamics, by repeatedly encoding classical data into a quantum state. Aside from providing a prediction model, we spend considerable effort in comparing the performance of our hybrid quantum-classical neural networks with classical deep learning approaches. Our results show that hybrid models achieve competitive accuracy with state-of-the-art classical methods, especially when the number of qubits and re-uploading blocks is increased. While the classical models demonstrate lower computational demands, we provide evidence that increasing the complexity of the quantum model improves predictive accuracy. These findings indicate that QML techniques, and specifically the data re-uploading approach, hold promise for advancing traffic forecasting models and could be instrumental in addressing challenges inherent in ITS environments.

[AI-56] Practical quantum federated learning and its experimental demonstration

链接: https://arxiv.org/abs/2501.12709
作者: Zhi-Ping Liu,Xiao-Yu Cao,Hao-Wen Liu,Xiao-Ran Sun,Yu Bao,Yu-Shuo Lu,Hua-Lei Yin,Zeng-Bing Chen
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 21 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Federated learning is essential for decentralized, privacy-preserving model training in the data-driven era. Quantum-enhanced federated learning leverages quantum resources to address privacy and scalability challenges, offering security and efficiency advantages beyond classical methods. However, practical and scalable frameworks addressing privacy concerns in the quantum computing era remain undeveloped. Here, we propose a practical quantum federated learning framework on quantum networks, utilizing distributed quantum secret keys to protect local model updates and enable secure aggregation with information-theoretic security. We experimentally validate our framework on a 4-client quantum network with a scalable structure. Extensive numerical experiments on both quantum and classical datasets show that adding a quantum client significantly enhances the trained global model’s ability to classify multipartite entangled and non-stabilizer quantum datasets. Simulations further demonstrate scalability to 200 clients with classical models trained on the MNIST dataset, reducing communication costs by 75% through advanced model compression techniques and achieving rapid training convergence. Our work provides critical insights for building scalable, efficient, and quantum-secure machine learning systems for the coming quantum internet era.

[AI-57] GATE: Adaptive Learning with Working Memory by Information Gating in Multi-lamellar Hippocampal Formation

链接: https://arxiv.org/abs/2501.12615
作者: Yuechen Liu,Zishun Wang,Chen Qiao,Zongben Xu
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hippocampal formation (HF) can rapidly adapt to varied environments and build flexible working memory (WM). To mirror the HF’s mechanism on generalization and WM, we propose a model named Generalization and Associative Temporary Encoding (GATE), which deploys a 3-D multi-lamellar dorsoventral (DV) architecture, and learns to build up internally representation from externally driven information layer-wisely. In each lamella, regions of HF: EC3-CA1-EC5-EC3 forms a re-entrant loop that discriminately maintains information by EC3 persistent activity, and selectively readouts the retained information by CA1 neurons. CA3 and EC5 further provides gating function that controls these processes. After learning complex WM tasks, GATE forms neuron representations that align with experimental records, including splitter, lap, evidence, trace, delay-active cells, as well as conventional place cells. Crucially, DV architecture in GATE also captures information, range from detailed to abstract, which enables a rapid generalization ability when cue, environment or task changes, with learned representations inherited. GATE promises a viable framework for understanding the HF’s flexible memory mechanisms and for progressively developing brain-inspired intelligent systems.

机器学习

[LG-0] One-Class Domain Adaptation via Meta-Learning

链接: https://arxiv.org/abs/2501.13052
作者: Stephanie Holly,Thomas Bierweiler,Stefan von Dosky,Ahmed Frikha,Clemens Heitzinger,Jana Eder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The deployment of IoT (Internet of Things) sensor-based machine learning models in industrial systems for anomaly classification tasks poses significant challenges due to distribution shifts, as the training data acquired in controlled laboratory settings may significantly differ from real-time data in production environments. Furthermore, many real-world applications cannot provide a substantial number of labeled examples for each anomalous class in every new environment. It is therefore crucial to develop adaptable machine learning models that can be effectively transferred from one environment to another, enabling rapid adaptation using normal operational data. We extended this problem setting to an arbitrary classification task and formulated the one-class domain adaptation (OC-DA) problem setting. We took a meta-learning approach to tackle the challenge of OC-DA, and proposed a task sampling strategy to adapt any bi-level meta-learning algorithm to OC-DA. We modified the well-established model-agnostic meta-learning (MAML) algorithm and introduced the OC-DA MAML algorithm. We provided a theoretical analysis showing that OC-DA MAML optimizes for meta-parameters that enable rapid one-class adaptation across domains. The OC-DA MAML algorithm is evaluated on the Rainbow-MNIST meta-learning benchmark and on a real-world dataset of vibration-based sensor readings. The results show that OC-DA MAML significantly improves the performance on the target domains and outperforms MAML using the standard task sampling strategy.

[LG-1] meFilter: Patch-Specific Spatial-Temporal Graph Filtration for Time Series Forecasting

链接: https://arxiv.org/abs/2501.13041
作者: Yifan Hu,Guibin Zhang,Peiyuan Liu,Disen Lan,Naiqi Li,Dawei Cheng,Tao Dai,Shu-Tao Xia,Shirui Pan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current time series forecasting methods can be broadly classified into two categories: Channel Independent (CI) and Channel Dependent (CD) strategies, both aiming to capture the complex dependencies within time series data. However, the CI strategy fails to exploit highly correlated covariate information, while the CD strategy integrates all dependencies, including irrelevant or noisy ones, thus compromising generalization. To mitigate these issues, recent works have introduced the Channel Clustering (CC) strategy by grouping channels with similar characteristics and applying different modeling techniques to each cluster. However, coarse-grained clustering cannot flexibly capture complex, time-varying interactions. Addressing the above challenges, we propose TimeFilter, a graph-based framework for adaptive and fine-grained dependency modeling. Specifically, after constructing the graph with the input sequence, TimeFilter filters out irrelevant correlations and preserves the most critical ones through patch-specific filtering. Extensive experiments on 13 real-world datasets from various application domains demonstrate the state-of-the-art performance of TimeFilter. The code is available at this https URL.

[LG-2] A Probabilistic Model for Self-Supervised Learning

链接: https://arxiv.org/abs/2501.13031
作者: Maximilian Fleissner,Pascal Esser,Debarghya Ghoshdastidar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) aims to find meaningful representations from unlabeled data by encoding semantic similarities through data augmentations. Despite its current popularity, theoretical insights about SSL are still scarce. For example, it is not yet known whether commonly used SSL loss functions can be related to a statistical model, much in the same as OLS, generalized linear models or PCA naturally emerge as maximum likelihood estimates of an underlying generative process. In this short paper, we consider a latent variable statistical model for SSL that exhibits an interesting property: Depending on the informativeness of the data augmentations, the MLE of the model either reduces to PCA, or approaches a simple non-contrastive loss. We analyze the model and also empirically illustrate our findings.

[LG-3] Multi-Objective Hyperparameter Selection via Hypothesis Testing on Reliability Graphs

链接: https://arxiv.org/abs/2501.13018
作者: Amirmohammad Farzaneh,Osvaldo Simeone
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:In sensitive application domains, multi-objective hyperparameter selection can ensure the reliability of AI models prior to deployment, while optimizing auxiliary performance metrics. The state-of-the-art Pareto Testing (PT) method guarantees statistical reliability constraints by adopting a multiple hypothesis testing framework. In PT, hyperparameters are validated one at a time, following a data-driven order determined by expected reliability levels. This paper introduces a novel framework for multi-objective hyperparameter selection that captures the interdependencies among the reliability levels of different hyperparameter configurations using a directed acyclic graph (DAG), which is termed the reliability graph (RG). The RG is constructed based on prior information and data by using the Bradley-Terry model. The proposed approach, RG-based PT (RG-PT), leverages the RG to enable the efficient, parallel testing of multiple hyperparameters at the same reliability level. By integrating False Discovery Rate (FDR) control, RG-PT ensures robust statistical reliability guarantees and is shown via experiments across diverse domains to consistently yield superior solutions for multi-objective calibration problems.

[LG-4] he regret lower bound for communicating Markov Decision Processes

链接: https://arxiv.org/abs/2501.13013
作者: Victor Boone,Odalric-Ambrym Maillard
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper is devoted to the extension of the regret lower bound beyond ergodic Markov decision processes (MDPs) in the problem dependent setting. While the regret lower bound for ergodic MDPs is well-known and reached by tractable algorithms, we prove that the regret lower bound becomes significatively more complex in communicating MDPs. Our lower bound revisits the necessary explorative behavior of consistent learning agents and further explains that all optimal regions of the environment must be overvisited compared to sub-optimal ones, a phenomenon that we refer to as co-exploration. In tandem, we show that these two explorative and co-explorative behaviors are intertwined with navigation constraints obtained by scrutinizing the navigation structure at logarithmic scale. The resulting lower bound is expressed as the solution of an optimization problem that, in many standard classes of MDPs, can be specialized to recover existing results. From a computational perspective, it is provably \Sigma_2^\textrmP -hard in general and as a matter of fact, even testing the membership to the feasible region is coNP-hard. We further provide an algorithm to approximate the lower bound in a constructive way.

[LG-5] An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management

链接: https://arxiv.org/abs/2501.12991
作者: Eslam Eldeeb,Hirley Alves
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) addresses key limitations of online MARL, such as safety concerns, expensive data collection, extended training intervals, and high signaling overhead caused by online interactions with the environment. In this work, we propose an offline MARL algorithm for radio resource management (RRM), focusing on optimizing scheduling policies for multiple access points (APs) to jointly maximize the sum and tail rates of user equipment (UEs). We evaluate three training paradigms: centralized, independent, and centralized training with decentralized execution (CTDE). Our simulation results demonstrate that the proposed offline MARL framework outperforms conventional baseline approaches, achieving over a 15% improvement in a weighted combination of sum and tail rates. Additionally, the CTDE framework strikes an effective balance, reducing the computational complexity of centralized methods while addressing the inefficiencies of independent training. These results underscore the potential of offline MARL to deliver scalable, robust, and efficient solutions for resource management in dynamic wireless networks.

[LG-6] Correctness Assessment of Code Generated by Large Language Models Using Internal Representations

链接: https://arxiv.org/abs/2501.12934
作者: Tuan-Dung Bui,Thanh Trong Vu,Thu-Trang Nguyen,Son Nguyen,Hieu Dinh Vo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring the correctness of code generated by Large Language Models (LLMs) presents a significant challenge in AI-driven software development. Existing approaches predominantly rely on black-box (closed-box) approaches that evaluate correctness post-generation, failing to utilize the rich insights embedded in the LLMs’ internal states during code generation. In this paper, we introduce OPENIA, a novel white-box (open-box) framework that leverages these internal representations to assess the correctness of LLM-generated code. OPENIA systematically analyzes the intermediate states of representative open-source LLMs specialized for code, including DeepSeek-Coder, CodeLlama, and MagicCoder, across diverse code generation benchmarks. Our empirical analysis reveals that these internal representations encode latent information, which strongly correlates with the correctness of the generated code. Building on these insights, OPENIA uses a white-box/open-box approach to make informed predictions about code correctness, offering significant advantages in adaptability and robustness over traditional classification-based methods and zero-shot approaches. Experimental results demonstrate that OPENIA consistently outperforms baseline models, achieving higher accuracy, precision, recall, and F1-Scores with up to a 2X improvement in standalone code generation and a 46% enhancement in repository-specific scenarios. By unlocking the potential of in-process signals, OPENIA paves the way for more proactive and efficient quality assurance mechanisms in LLM-assisted code generation.

[LG-7] Longitudinal Missing Data Imputation for Predicting Disability Stage of Patients with Multiple Sclerosis

链接: https://arxiv.org/abs/2501.12927
作者: Mahin Vazifehdan,Pietro Bosoni,Daniele Pala,Eleonora Tavazzi,Roberto Bergamaschi,Riccardo Bellazzi,Arianna Dagliati
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 tables

点击查看摘要

Abstract:Multiple Sclerosis (MS) is a chronic disease characterized by progressive or alternate impairment of neurological functions (motor, sensory, visual, and cognitive). Predicting disease progression with a probabilistic and time-dependent approach might help in suggesting interventions that can delay the progression of the disease. However, extracting informative knowledge from irregularly collected longitudinal data is difficult, and missing data pose significant challenges. MS progression is measured through the Expanded Disability Status Scale (EDSS), which quantifies and monitors disability in MS over time. EDSS assesses impairment in eight functional systems (FS). Frequently, only the EDSS score assigned by clinicians is reported, while FS sub-scores are missing. Imputing these scores might be useful, especially to stratify patients according to their phenotype assessed over the disease progression. This study aimed at i) exploring different methodologies for imputing missing FS sub-scores, and ii) predicting the EDSS score using complete clinical data. Results show that Exponential Weighted Moving Average achieved the lowest error rate in the missing data imputation task; furthermore, the combination of Classification and Regression Trees for the imputation and SVM for the prediction task obtained the best accuracy.

[LG-8] Contrastive Language-Structure Pre-training Driven by Materials Science Literature

链接: https://arxiv.org/abs/2501.12919
作者: Yuta Suzuki,Tatsunori Taniai,Ryo Igarashi,Kotaro Saito,Naoya Chiba,Yoshitaka Ushiku,Kanta Ono
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Understanding structure-property relationships is an essential yet challenging aspect of materials discovery and development. To facilitate this process, recent studies in materials informatics have sought latent embedding spaces of crystal structures to capture their similarities based on properties and functionalities. However, abstract feature-based embedding spaces are human-unfriendly and prevent intuitive and efficient exploration of the vast materials space. Here we introduce Contrastive Language–Structure Pre-training (CLaSP), a learning paradigm for constructing crossmodal embedding spaces between crystal structures and texts. CLaSP aims to achieve material embeddings that 1) capture property- and functionality-related similarities between crystal structures and 2) allow intuitive retrieval of materials via user-provided description texts as queries. To compensate for the lack of sufficient datasets linking crystal structures with textual descriptions, CLaSP leverages a dataset of over 400,000 published crystal structures and corresponding publication records, including paper titles and abstracts, for training. We demonstrate the effectiveness of CLaSP through text-based crystal structure screening and embedding space visualization.

[LG-9] A Selective Homomorphic Encryption Approach for Faster Privacy-Preserving Federated Learning

链接: https://arxiv.org/abs/2501.12911
作者: Abdulkadir Korkmaz,Praveen Rao
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 23 pages, 32 figures

点击查看摘要

Abstract:Federated learning is a machine learning method that supports training models on decentralized devices or servers, where each holds its local data, removing the need for data exchange. This approach is especially useful in healthcare, as it enables training on sensitive data without needing to share them. The nature of federated learning necessitates robust security precautions due to data leakage concerns during communication. To address this issue, we propose a new approach that employs selective encryption, homomorphic encryption, differential privacy, and bit-wise scrambling to minimize data leakage while achieving good execution performance. Our technique , FAS (fast and secure federated learning) is used to train deep learning models on medical imaging data. We implemented our technique using the Flower framework and compared with a state-of-the-art federated learning approach that also uses selective homomorphic encryption. Our experiments were run in a cluster of eleven physical machines to create a real-world federated learning scenario on different datasets. We observed that our approach is up to 90% faster than applying fully homomorphic encryption on the model weights. In addition, we can avoid the pretraining step that is required by our competitor and can save up to 20% in terms of total execution time. While our approach was faster, it obtained similar security results as the competitor.

[LG-10] Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi

链接: https://arxiv.org/abs/2501.12900
作者: Ella Koresh,Ronit D. Gross,Yuval Meir,Yarden Tzach,Tal Halevi,Ido Kanter
类目: Machine Learning (cs.LG)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers, whereas vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers. Both are designed to solve complex classification tasks but from different perspectives. This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism, which quantitatively measures the single-nodal performance (SNP) of each node in feedforward (FF) and multi-head attention (MHA) subblocks. Each node identifies small clusters of possible output labels, with additional noise represented as labels outside these clusters. These features are progressively sharpened along the transformer encoders, enhancing the signal-to-noise ratio. This unified underlying learning mechanism leads to two main findings. First, it enables an efficient applied nodal diagonal connection (ANDC) pruning technique without affecting the accuracy. Second, based on the SNP, spontaneous symmetry breaking occurs among the MHA heads, such that each head focuses its attention on a subset of labels through cooperation among its SNPs. Consequently, each head becomes an expert in recognizing its designated labels, representing a quantitative MHA modus vivendi mechanism. These results are based on a compact convolutional transformer architecture trained on the CIFAR-100 and Flowers-102 datasets and call for their extension to other architectures and applications, such as natural language processing.

[LG-11] Irrational Complex Rotations Empower Low-bit Optimizers

链接: https://arxiv.org/abs/2501.12896
作者: Zhen Tian,Wayne Xin Zhao,Ji-Rong Wen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel optimizer state compression algorithm, namely \pi -Quant, which leverages the properties of irrational numbers (e.g., \pi ) for memory-efficient training. The core idea is based on our mathematical findings, which show that a pair of parameters can be represented by a single rotation angle using the complex rotation scheme. Building on this insight, we map the parameters into a complex space and perform quantization using the corresponding rotation angles. To efficiently integrate it into optimization process, we develop an efficient system of geometric equations that computes the precise rotation angles with linear complexity. We evaluate \pi -Quant on a wide range of tasks. Our experiments show that it can reduce the bit-width of parameters to 3.32-bit, achieving a 75% reduction in parameter scale and a 40% decrease in GPU memory usage, all while maintaining full accuracy.

[LG-12] Advanced deep architecture pruning using single filter performance

链接: https://arxiv.org/abs/2501.12880
作者: Yarden Tzach,Yuval Meir,Ronit D. Gross,Ofek Tevet,Ella Koresh,Ido Kanter
类目: Machine Learning (cs.LG)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:Pruning the parameters and structure of neural networks reduces the computational complexity, energy consumption, and latency during inference. Recently, a novel underlying mechanism for successful deep learning (DL) was presented based on a method that quantitatively measures the single filter performance in each layer of a DL architecture, and a new comprehensive mechanism of how deep learning works was presented. Herein, we demonstrate how this understanding paves the path to highly dilute the convolutional layers of deep architectures without affecting their overall accuracy using applied filter cluster connections (AFCC). AFCC is exemplified on VGG-11 and EfficientNet-B0 architectures trained on CIFAR-100, and its high pruning outperforms other techniques using the same pruning magnitude. Additionally, this technique is broadened to single nodal performance and highly pruning of fully connected layers, suggesting a possible implementation to considerably reduce the complexity of over-parameterized AI tasks.

[LG-13] HierPromptLM: A Pure PLM-based Framework for Representation Learning on Heterogeneous Text-rich Networks

链接: https://arxiv.org/abs/2501.12857
作者: Qiuyu Zhu,Liang Zhang,Qianxiong Xu,Cheng Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning on heterogeneous text-rich networks (HTRNs), which consist of multiple types of nodes and edges with each node associated with textual information, is essential for various real-world applications. Given the success of pretrained language models (PLMs) in processing text data, recent efforts have focused on integrating PLMs into HTRN representation learning. These methods typically handle textual and structural information separately, using both PLMs and heterogeneous graph neural networks (HGNNs). However, this separation fails to capture the critical interactions between these two types of information within HTRNs. Additionally, it necessitates an extra alignment step, which is challenging due to the fundamental differences between distinct embedding spaces generated by PLMs and HGNNs. To deal with it, we propose HierPromptLM, a novel pure PLM-based framework that seamlessly models both text data and graph structures without the need for separate processing. Firstly, we develop a Hierarchical Prompt module that employs prompt learning to integrate text data and heterogeneous graph structures at both the node and edge levels, within a unified textual space. Building upon this foundation, we further introduce two innovative HTRN-tailored pretraining tasks to fine-tune PLMs for representation learning by emphasizing the inherent heterogeneity and interactions between textual and structural information within HTRNs. Extensive experiments on two real-world HTRN datasets demonstrate HierPromptLM outperforms state-of-the-art methods, achieving significant improvements of up to 6.08% for node classification and 10.84% for link prediction.

[LG-14] Data-and-Semantic Dual-Driven Spectrum Map Construction for 6G Spectrum Management

链接: https://arxiv.org/abs/2501.12853
作者: Jiayu Liu,Fuhui Zhou,Xiaodong Liu,Rui Ding,Lu Yuan,Qihui Wu
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at the IEEE Global Communications Conference (GLOBECOM), Cape Town, South Africa, December 2024

点击查看摘要

Abstract:Spectrum maps reflect the utilization and distribution of spectrum resources in the electromagnetic environment, serving as an effective approach to support spectrum management. However, the construction of spectrum maps in urban environments is challenging because of high-density connection and complex terrain. Moreover, the existing spectrum map construction methods are typically applied to a fixed frequency, which cannot cover the entire frequency band. To address the aforementioned challenges, a UNet-based data-and-semantic dual-driven method is proposed by introducing the semantic knowledge of binary city maps and binary sampling location maps to enhance the accuracy of spectrum map construction in complex urban environments with dense communications. Moreover, a joint frequency-space reasoning model is exploited to capture the correlation of spectrum data in terms of space and frequency, enabling the realization of complete spectrum map construction without sampling all frequencies of spectrum data. The simulation results demonstrate that the proposed method can infer the spectrum utilization status of missing frequencies and improve the completeness of the spectrum map construction. Furthermore, the accuracy of spectrum map construction achieved by the proposed data-and-semantic dual-driven method outperforms the benchmark schemes, especially in scenarios with low sampling density.

[LG-15] Certified Guidance for Planning with Deep Generative Models AAMAS25

链接: https://arxiv.org/abs/2501.12815
作者: Francesco Giacomarra,Mehran Hosseini,Nicola Paoletti,Francesca Cairoli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 2 figures, accepted at AAMAS 25 conference

点击查看摘要

Abstract:Deep generative models, such as generative adversarial networks and diffusion models, have recently emerged as powerful tools for planning tasks and behavior synthesis in autonomous systems. Various guidance strategies have been introduced to steer the generative process toward outputs that are more likely to satisfy the planning objectives. These strategies avoid the need for model retraining but do not provide any guarantee that the generated outputs will satisfy the desired planning objectives. To address this limitation, we introduce certified guidance, an approach that modifies a generative model, without retraining it, into a new model guaranteed to satisfy a given specification with probability one. We focus on Signal Temporal Logic specifications, which are rich enough to describe nontrivial planning tasks. Our approach leverages neural network verification techniques to systematically explore the latent spaces of the generative models, identifying latent regions that are certifiably correct with respect to the STL property of interest. We evaluate the effectiveness of our method on four planning benchmarks using GANs and diffusion models. Our results confirm that certified guidance produces generative models that are always correct, unlike existing guidance methods that are not certified.

[LG-16] Hybrid Losses for Hierarchical Embedding Learning ICASSP2025

链接: https://arxiv.org/abs/2501.12796
作者: Haokun Tian,Stefan Lattner,Brian McFee,Charalampos Saitis
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:In traditional supervised learning, the cross-entropy loss treats all incorrect predictions equally, ignoring the relevance or proximity of wrong labels to the correct answer. By leveraging a tree hierarchy for fine-grained labels, we investigate hybrid losses, such as generalised triplet and cross-entropy losses, to enforce similarity between labels within a multi-task learning framework. We propose metrics to evaluate the embedding space structure and assess the model’s ability to generalise to unseen classes, that is, to infer similar classes for data belonging to unseen categories. Our experiments on OrchideaSOL, a four-level hierarchical instrument sound dataset with nearly 200 detailed categories, demonstrate that the proposed hybrid losses outperform previous works in classification, retrieval, embedding space structure, and generalisation.

[LG-17] Non-adaptive Learning of Random Hypergraphs with Queries

链接: https://arxiv.org/abs/2501.12771
作者: Bethany Austhof,Lev Reyzin,Erasmo Tani
类目: Information Theory (cs.IT); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning a hidden hypergraph G=(V,E) by making a single batch of queries (non-adaptively). We consider the hyperedge detection model, in which every query must be of the form: ``Does this set S\subseteq V contain at least one full hyperedge?‘’ In this model, it is known that there is no algorithm that allows to non-adaptively learn arbitrary hypergraphs by making fewer than \Omega(\min\m^2\log n, n^2) even when the hypergraph is constrained to be 2 -uniform (i.e. the hypergraph is simply a graph). Recently, Li et al. overcame this lower bound in the setting in which G is a graph by assuming that the graph learned is sampled from an Erdős-Rényi model. We generalize the result of Li et al. to the setting of random k -uniform hypergraphs. To achieve this result, we leverage a novel equivalence between the problem of learning a single hyperedge and the standard group testing problem. This latter result may also be of independent interest. Subjects: Information Theory (cs.IT); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2501.12771 [cs.IT] (or arXiv:2501.12771v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2501.12771 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] Multiscale Training of Convolutional Neural Networks

链接: https://arxiv.org/abs/2501.12739
作者: Niloufar Zakariaei,Shadab Ahamed,Eldad Haber,Moshe Eliasof
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are the backbone of many deep learning methods, but optimizing them remains computationally expensive. To address this, we explore multiscale training frameworks and mathematically identify key challenges, particularly when dealing with noisy inputs. Our analysis reveals that in the presence of noise, the gradient of standard CNNs in multiscale training may fail to converge as the mesh-size approaches to , undermining the optimization process. This insight drives the development of Mesh-Free Convolutions (MFCs), which are independent of input scale and avoid the pitfalls of traditional convolution kernels. We demonstrate that MFCs, with their robust gradient behavior, ensure convergence even with noisy inputs, enabling more efficient neural network optimization in multiscale settings. To validate the generality and effectiveness of our multiscale training approach, we show that (i) MFCs can theoretically deliver substantial computational speedups without sacrificing performance in practice, and (ii) standard convolutions benefit from our multiscale training framework in practice.

[LG-19] Stability and Generalization of Quantum Neural Networks

链接: https://arxiv.org/abs/2501.12737
作者: Jiaqi Yang,Wei Xie,Xiaohua Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Quantum neural networks (QNNs) play an important role as an emerging technology in the rapidly growing field of quantum machine learning. While their empirical success is evident, the theoretical explorations of QNNs, particularly their generalization properties, are less developed and primarily focus on the uniform convergence approach. In this paper, we exploit an advanced tool in statistical learning theory, i.e., algorithmic stability, to study the generalization of QNNs. We first establish high-probability generalization bounds for QNNs via uniform stability. Our bounds shed light on the key factors influencing the generalization performance of QNNs and provide practical insights into both the design and training processes. We next explore the generalization of QNNs on near-term noisy intermediate-scale quantum (NISQ) devices, highlighting the potential benefits of quantum noise. Moreover, we argue that previous analysis characterizes worst-case generalization guarantees, and we establish a refined optimization-dependent generalization bound for QNNs via on-average stability. Numerical experiments on various real-world datasets support our theoretical findings.

[LG-20] Online Preference Alignment for Language Models via Count-based Exploration ICLR2025

链接: https://arxiv.org/abs/2501.12735
作者: Chenjia Bai,Yang Zhang,Shuang Qiu,Qiaosheng Zhang,Kang Xu,Xuelong Li
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage, and the resulting reward model is hard to generalize in out-of-distribution responses. Thus, online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs. In this paper, we study the fundamental problem in online RLHF, i.e. \emphhow to explore for LLM. We give a theoretical motivation in linear reward assumption to show that an optimistic reward with an upper confidence bound (UCB) term leads to a provably efficient RLHF policy. Then, we reformulate our objective to direct preference optimization with an exploration term, where the UCB-term can be converted to a count-based exploration bonus. We further propose a practical algorithm, named \emphCount-based Online Preference Optimization (COPO), which leverages a simple coin-flip counting module to estimate the pseudo-count of a prompt-response pair in previously collected data. COPO encourages LLMs to balance exploration and preference optimization in an iterative manner, which enlarges the exploration space and the entire data coverage of iterative LLM policies. We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance.

[LG-21] GRAMA: Adaptive Graph Autoregressive Moving Averag e Models

链接: https://arxiv.org/abs/2501.12732
作者: Moshe Eliasof,Alessio Gravina,Andrea Ceni,Claudio Gallicchio,Davide Bacciu,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph State Space Models (SSMs) have recently been introduced to enhance Graph Neural Networks (GNNs) in modeling long-range interactions. Despite their success, existing methods either compromise on permutation equivariance or limit their focus to pairwise interactions rather than sequences. Building on the connection between Autoregressive Moving Average (ARMA) and SSM, in this paper, we introduce GRAMA, a Graph Adaptive method based on a learnable Autoregressive Moving Average (ARMA) framework that addresses these limitations. By transforming from static to sequential graph data, GRAMA leverages the strengths of the ARMA framework, while preserving permutation equivariance. Moreover, GRAMA incorporates a selective attention mechanism for dynamic learning of ARMA coefficients, enabling efficient and flexible long-range information propagation. We also establish theoretical connections between GRAMA and Selective SSMs, providing insights into its ability to capture long-range dependencies. Extensive experiments on 14 synthetic and real-world datasets demonstrate that GRAMA consistently outperforms backbone models and performs competitively with state-of-the-art methods.

[LG-22] Anomaly Detection in Double-entry Bookkeeping Data by Federated Learning System with Non-model Sharing Approach

链接: https://arxiv.org/abs/2501.12723
作者: Sota Mashiko,Yuji Kawamata,Tomoru Nakayama,Tetsuya Sakurai,Yukihiko Okada
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection is crucial in financial auditing and effective detection often requires obtaining large volumes of data from multiple organizations. However, confidentiality concerns hinder data sharing among audit firms. Although the federated learning (FL)-based approach, FedAvg, has been proposed to address this challenge, its use of mutiple communication rounds increases its overhead, limiting its practicality. In this study, we propose a novel framework employing Data Collaboration (DC) analysis – a non-model share-type FL method – to streamline model training into a single communication round. Our method first encodes journal entry data via dimensionality reduction to obtain secure intermediate representations, then transforms them into collaboration representations for building an autoencoder that detects anomalies. We evaluate our approach on a synthetic dataset and real journal entry data from multiple organizations. The results show that our method not only outperforms single-organization baselines but also exceeds FedAvg in non-i.i.d. experiments on real journal entry data that closely mirror real-world conditions. By preserving data confidentiality and reducing iterative communication, this study addresses a key auditing challenge – ensuring data confidentiality while integrating knowledge from multiple audit firms. Our findings represent a significant advance in artificial intelligence-driven auditing and underscore the potential of FL methods in high-security domains.

[LG-23] REX: Causal Discovery based on Machine Learning and Explainability techniques

链接: https://arxiv.org/abs/2501.12706
作者: Jesus Renero,Idoia Ochoa,Roberto Maestre
类目: Machine Learning (cs.LG)
*备注: 22 pages, 30 figures, Submitted to Elsevier’s Pattern Recognition

点击查看摘要

Abstract:Explainability techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce REX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that REX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, REX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase REX’s effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, REX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures. REX is publicly available at this https URL. Comments: 22 pages, 30 figures, Submitted to Elsevier’s Pattern Recognition Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.12706 [cs.LG] (or arXiv:2501.12706v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.12706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-24] EchoLM: Accelerating LLM Serving with Real-time Knowledge Distillation

链接: https://arxiv.org/abs/2501.12689
作者: Yifan Yu,Yu Gan,Lily Tasi,Nikhil Sarda,Jiaming Shen,Yanqi Zhou,Arvind Krishnamurthy,Fan Lai,Henry M. Levy,David Culler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have excelled in various applications, yet serving them at scale is challenging due to their substantial resource demands and high latency. Our real-world studies reveal that over 60% of user requests to LLMs have semantically similar counterparts, suggesting the potential for knowledge sharing among requests. However, naively caching and reusing past responses leads to large quality degradation. In this paper, we introduce EchoLM, an in-context caching system that leverages historical requests as examples to guide response generation, enabling selective offloading of requests to more efficient LLMs. However, enabling this real-time knowledge transfer leads to intricate tradeoffs between response quality, latency, and system throughput at scale. For a new request, EchoLM identifies similar, high-utility examples and efficiently prepends them to the input for better response. At scale, EchoLM adaptively routes requests to LLMs of varying capabilities, accounting for response quality and serving loads. EchoLM employs a cost-aware cache replay mechanism to improve example quality and coverage offline, maximizing cache utility and runtime efficiency. Evaluations on millions of open-source requests demonstrate that EchoLM has a throughput improvement of 1.4-5.9x while reducing latency by 28-71% without hurting response quality on average.

[LG-25] Manifold learning and optimization using tangent space proxies

链接: https://arxiv.org/abs/2501.12678
作者: Ryan A. Robinett,Lorenzo Orecchia,Samantha J. Riesenfeld
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 37 pages, 9 figures

点击查看摘要

Abstract:We present a framework for efficiently approximating differential-geometric primitives on arbitrary manifolds via construction of an atlas graph representation, which leverages the canonical characterization of a manifold as a finite collection, or atlas, of overlapping coordinate charts. We first show the utility of this framework in a setting where the manifold is expressed in closed form, specifically, a runtime advantage, compared with state-of-the-art approaches, for first-order optimization over the Grassmann manifold. Moreover, using point cloud data for which a complex manifold structure was previously established, i.e., high-contrast image patches, we show that an atlas graph with the correct geometry can be directly learned from the point cloud. Finally, we demonstrate that learning an atlas graph enables downstream key machine learning tasks. In particular, we implement a Riemannian generalization of support vector machines that uses the learned atlas graph to approximate complex differential-geometric primitives, including Riemannian logarithms and vector transports. These settings suggest the potential of this framework for even more complex settings, where ambient dimension and noise levels may be much higher.

[LG-26] Learning Versatile Optimizers on a Compute Diet

链接: https://arxiv.org/abs/2501.12670
作者: Abhinav Moudgil,Boris Knyazev,Guillaume Lajoie,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned update rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.

[LG-27] PPO-Based Vehicle Control for Ramp Merging Scheme Assisted by Enhanced C-V2X

链接: https://arxiv.org/abs/2501.12656
作者: Qiong Wu,Maoxin Ji,Pingyi Fan,Kezhi Wang,Nan Cheng,Wen Chen,Khaled B. Letaief
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:On-ramp merging presents a critical challenge in autonomous driving, as vehicles from merging lanes need to dynamically adjust their positions and speeds while monitoring traffic on the main road to prevent collisions. To address this challenge, we propose a novel merging control scheme based on reinforcement learning, which integrates lateral control mechanisms. This approach ensures the smooth integration of vehicles from the merging lane onto the main road, optimizing both fuel efficiency and passenger comfort. Furthermore, we recognize the impact of vehicle-to-vehicle (V2V) communication on control strategies and introduce an enhanced protocol leveraging Cellular Vehicle-to-Everything (C-V2X) Mode 4. This protocol aims to reduce the Age of Information (AoI) and improve communication reliability. In our simulations, we employ two AoI-based metrics to rigorously assess the protocol’s effectiveness in autonomous driving scenarios. By combining the NS3 network simulator with Python, we simulate V2V communication and vehicle control simultaneously. The results demonstrate that the enhanced C-V2X Mode 4 outperforms the standard version, while the proposed control scheme ensures safe and reliable vehicle operation during on-ramp merging.

[LG-28] Current Opinions on Memristor-Accelerated Machine Learning Hardware

链接: https://arxiv.org/abs/2501.12644
作者: Mingrui Jiang,Yichun Xu,Zefan Li,Can Li
类目: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Signal Processing (eess.SP); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:The unprecedented advancement of artificial intelligence has placed immense demands on computing hardware, but traditional silicon-based semiconductor technologies are approaching their physical and economic limit, prompting the exploration of novel computing paradigms. Memristor offers a promising solution, enabling in-memory analog computation and massive parallelism, which leads to low latency and power consumption. This manuscript reviews the current status of memristor-based machine learning accelerators, highlighting the milestones achieved in developing prototype chips, that not only accelerate neural networks inference but also tackle other machine learning tasks. More importantly, it discusses our opinion on current key challenges that remain in this field, such as device variation, the need for efficient peripheral circuitry, and systematic co-design and optimization. We also share our perspective on potential future directions, some of which address existing challenges while others explore untouched territories. By addressing these challenges through interdisciplinary efforts spanning device engineering, circuit design, and systems architecture, memristor-based accelerators could significantly advance the capabilities of AI hardware, particularly for edge applications where power efficiency is paramount.

[LG-29] Deep Reinforcement Learning with Hybrid Intrinsic Reward Model

链接: https://arxiv.org/abs/2501.12627
作者: Mingqi Yuan,Bo Li,Xin Jin,Wenjun Zeng
类目: Machine Learning (cs.LG)
*备注: 18 pages, 14 figures

点击查看摘要

Abstract:Intrinsic reward shaping has emerged as a prevalent approach to solving hard-exploration and sparse-rewards environments in reinforcement learning (RL). While single intrinsic rewards, such as curiosity-driven or novelty-based methods, have shown effectiveness, they often limit the diversity and efficiency of exploration. Moreover, the potential and principle of combining multiple intrinsic rewards remains insufficiently explored. To address this gap, we introduce HIRE (Hybrid Intrinsic REward), a flexible and elegant framework for creating hybrid intrinsic rewards through deliberate fusion strategies. With HIRE, we conduct a systematic analysis of the application of hybrid intrinsic rewards in both general and unsupervised RL across multiple benchmarks. Extensive experiments demonstrate that HIRE can significantly enhance exploration efficiency and diversity, as well as skill acquisition in complex and dynamic settings.

[LG-30] oward Model-centric Heterogeneous Federated Graph Learning: A Knowledge-driven Approach

链接: https://arxiv.org/abs/2501.12624
作者: Huilin lai,Guang Zeng,Xunkai Li,Xudong Shen,Yinlin Zhu,Ye Luo,Jianwei Lu,Lei Zhu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated graph learning (FGL) has emerged as a promising paradigm for collaborative machine learning, enabling multiple parties to jointly train models while preserving the privacy of raw graph data. However, existing FGL methods often overlook the model-centric heterogeneous FGL (MHtFGL) problem, which arises in real-world applications, such as the aggregation of models from different companies with varying scales and architectures. MHtFGL presents an additional challenge: the diversity of client model architectures hampers common learning and integration of graph representations. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework, comprising two key components: Client-side Self-Mutual Knowledge Distillation, which fosters effective knowledge sharing among clients through copilot models; and Server-side Knowledge-Aware Model Aggregation, which enhances model integration by accounting for the knowledge acquired by clients. Experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy improvement of 3.74% over baseline models in MHtFGL scenarios, while also maintaining excellent performance in homogeneous settings.

[LG-31] Low-Dimensional Representation-Driven TSK Fuzzy System for Feature Selection

链接: https://arxiv.org/abs/2501.12607
作者: Qiong Liu,Mingjie Cai,Qingguo Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature selection can select important features to address dimensional curses. Subspace learning, a widely used dimensionality reduction method, can project the original data into a low-dimensional space. However, the low-dimensional representation is often transformed back into the original space, resulting in information loss. Additionally, gate function-based methods in Takagi-Sugeno-Kang fuzzy system (TSK-FS) are commonly less discrimination. To address these issues, this paper proposes a novel feature selection method that integrates subspace learning with TSK-FS. Specifically, a projection matrix is used to fit the intrinsic low-dimensional representation. Subsequently, the low-dimensional representation is fed to TSK-FS to measure its availability. The firing strength is slacked so that TSK-FS is not limited by numerical underflow. Finally, the \ell _2,1 -norm is introduced to select significant features and the connection to related works is discussed. The proposed method is evaluated against six state-of-the-art methods on eighteen datasets, and the results demonstrate the superiority of the proposed method.

[LG-32] On Accelerating Deep Neural Network Mutation Analysis by Neuron and Mutant Clustering

链接: https://arxiv.org/abs/2501.12598
作者: Lauren Lyons,Ali Ghanbari
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 18th IEEE International Conference on Software Testing, Verification and Validation (ICST) 2025

点击查看摘要

Abstract:Mutation analysis of deep neural networks (DNNs) is a promising method for effective evaluation of test data quality and model robustness, but it can be computationally expensive, especially for large models. To alleviate this, we present DEEPMAACC, a technique and a tool that speeds up DNN mutation analysis through neuron and mutant clustering. DEEPMAACC implements two methods: (1) neuron clustering to reduce the number of generated mutants and (2) mutant clustering to reduce the number of mutants to be tested by selecting representative mutants for testing. Both use hierarchical agglomerative clustering to group neurons and mutants with similar weights, with the goal of improving efficiency while maintaining mutation score. DEEPMAACC has been evaluated on 8 DNN models across 4 popular classification datasets and two DNN architectures. When compared to exhaustive, or vanilla, mutation analysis, the results provide empirical evidence that neuron clustering approach, on average, accelerates mutation analysis by 69.77%, with an average -26.84% error in mutation score. Meanwhile, mutant clustering approach, on average, accelerates mutation analysis by 35.31%, with an average 1.96% error in mutation score. Our results demonstrate that a trade-off can be made between mutation testing speed and mutation score error.

[LG-33] Multi-Instance Partial-Label Learning with Margin Adjustment NEURIPS2024

链接: https://arxiv.org/abs/2501.12597
作者: Wei Tang,Yin-Fang Yang,Zhaofei Wang,Weijia Zhang,Min-Ling Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024. The code can be found at this https URL

点击查看摘要

Abstract:Multi-instance partial-label learning (MIPL) is an emerging learning framework where each training sample is represented as a multi-instance bag associated with a candidate label set. Existing MIPL algorithms often overlook the margins for attention scores and predicted probabilities, leading to suboptimal generalization performance. A critical issue with these algorithms is that the highest prediction probability of the classifier may appear on a non-candidate label. In this paper, we propose an algorithm named MIPLMA, i.e., Multi-Instance Partial-Label learning with Margin Adjustment, which adjusts the margins for attention scores and predicted probabilities. We introduce a margin-aware attention mechanism to dynamically adjust the margins for attention scores and propose a margin distribution loss to constrain the margins between the predicted probabilities on candidate and non-candidate label sets. Experimental results demonstrate the superior performance of MIPLMA over existing MIPL algorithms, as well as other well-established multi-instance learning algorithms and partial-label learning algorithms.

[LG-34] Generalization Performance of Hypergraph Neural Networks

链接: https://arxiv.org/abs/2501.12554
作者: Yifan Wang,Gonzalo R. Arce,Guangmo Tong
类目: Machine Learning (cs.LG)
*备注: The Web Conference 2025

点击查看摘要

Abstract:Hypergraph neural networks have been promising tools for handling learning tasks involving higher-order data, with notable applications in web graphs, such as modeling multi-way hyperlink structures and complex user interactions. Yet, their generalization abilities in theory are less clear to us. In this paper, we seek to develop margin-based generalization bounds for four representative classes of hypergraph neural networks, including convolutional-based methods (UniGCN), set-based aggregation (AllDeepSets), invariant and equivariant transformations (M-IGN), and tensor-based approaches (T-MPHN). Through the PAC-Bayes framework, our results reveal the manner in which hypergraph structure and spectral norms of the learned weights can affect the generalization bounds, where the key technical challenge lies in developing new perturbation analysis for hypergraph neural networks, which offers a rigorous understanding of how variations in the model’s weights and hypergraph structure impact its generalization behavior. Our empirical study examines the relationship between the practical performance and theoretical bounds of the models over synthetic and real-world datasets. One of our primary observations is the strong correlation between the theoretical bounds and empirical loss, with statistically significant consistency in most cases.

[LG-35] Federated Discrete Denoising Diffusion Model for Molecular Generation with OpenFL

链接: https://arxiv.org/abs/2501.12523
作者: Kevin Ta,Patrick Foley,Mattson Thieme,Abhishek Pandey,Prashant Shah
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Generating unique molecules with biochemically desired properties to serve as viable drug candidates is a difficult task that requires specialized domain expertise. In recent years, diffusion models have shown promising results in accelerating the drug design process through AI-driven molecular generation. However, training these models requires massive amounts of data, which are often isolated in proprietary silos. OpenFL is a federated learning framework that enables privacy-preserving collaborative training across these decentralized data sites. In this work, we present a federated discrete denoising diffusion model that was trained using OpenFL. The federated model achieves comparable performance with a model trained on centralized data when evaluating the uniqueness and validity of the generated molecules. This demonstrates the utility of federated learning in the drug design process. OpenFL is available at: this https URL Comments: 10 pages, 5 figures Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2501.12523 [cs.LG] (or arXiv:2501.12523v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.12523 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] opology of Out-of-Distribution Examples in Deep Neural Networks

链接: https://arxiv.org/abs/2501.12522
作者: Esha Datta,Johanna Hennig,Eva Domschot,Connor Mattes,Michael R. Smith
类目: Machine Learning (cs.LG)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:As deep neural networks (DNNs) become increasingly common, concerns about their robustness do as well. A longstanding problem for deployed DNNs is their behavior in the face of unfamiliar inputs; specifically, these models tend to be overconfident and incorrect when encountering out-of-distribution (OOD) examples. In this work, we present a topological approach to characterizing OOD examples using latent layer embeddings from DNNs. Our goal is to identify topological features, referred to as landmarks, that indicate OOD examples. We conduct extensive experiments on benchmark datasets and a realistic DNN model, revealing a key insight for OOD detection. Well-trained DNNs have been shown to induce a topological simplification on training data for simple models and datasets; we show that this property holds for realistic, large-scale test and training data, but does not hold for OOD examples. More specifically, we find that the average lifetime (or persistence) of OOD examples is statistically longer than that of training or test examples. This indicates that DNNs struggle to induce topological simplification on unfamiliar inputs. Our empirical results provide novel evidence of topological simplification in realistic DNNs and lay the groundwork for topologically-informed OOD detection strategies.

[LG-37] Robustness of Selected Learning Models under Label-Flipping Attack

链接: https://arxiv.org/abs/2501.12516
作者: Sarvagya Bhargava,Mark Stamp
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In this paper we compare traditional machine learning and deep learning models trained on a malware dataset when subjected to adversarial attack based on label-flipping. Specifically, we investigate the robustness of Support Vector Machines (SVM), Random Forest, Gaussian Naive Bayes (GNB), Gradient Boosting Machine (GBM), LightGBM, XGBoost, Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), MobileNet, and DenseNet models when facing varying percentages of misleading labels. We empirically assess the the accuracy of each of these models under such an adversarial attack on the training data. This research aims to provide insights into which models are inherently more robust, in the sense of being better able to resist intentional disruptions to the training data. We find wide variation in the robustness of the models tested to adversarial attack, with our MLP model achieving the best combination of initial accuracy and robustness.

[LG-38] Sequence Spreading-Based Semantic Communication Under High RF Interference

链接: https://arxiv.org/abs/2501.12502
作者: Hazem Barka,Georges Kaddoum,Mehdi Bennis,Md Sahabul Alam,Minh Au
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted in IEEE International Conference on Communications

点击查看摘要

Abstract:In the evolving landscape of wireless communications, semantic communication (SemCom) has recently emerged as a 6G enabler that prioritizes the transmission of meaning and contextual relevance over conventional bit-centric metrics. However, the deployment of SemCom systems in industrial settings presents considerable challenges, such as high radio frequency interference (RFI), that can adversely affect system performance. To address this problem, in this work, we propose a novel approach based on integrating sequence spreading techniques with SemCom to enhance system robustness against such adverse conditions and enable scalable multi-user (MU) SemCom. In addition, we propose a novel signal refining network (SRN) to refine the received signal after despreading and equalization. The proposed network eliminates the need for computationally intensive end-to-end (E2E) training while improving performance metrics, achieving a 25% gain in BLEU score and a 12% increase in semantic similarity compared to E2E training using the same bandwidth.

[LG-39] Identification of Nonparametric Dynamic Causal Structure and Latent Process in Climate System

链接: https://arxiv.org/abs/2501.12500
作者: Minghao Fu,Biwei Huang,Zijian Li,Yujia Zheng,Ignavier Ng,Yingyao Hu,Kun Zhang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The study of learning causal structure with latent variables has advanced the understanding of the world by uncovering causal relationships and latent factors, e.g., Causal Representation Learning (CRL). However, in real-world scenarios, such as those in climate systems, causal relationships are often nonparametric, dynamic, and exist among both observed variables and latent variables. These challenges motivate us to consider a general setting in which causal relations are nonparametric and unrestricted in their occurrence, which is unconventional to current methods. To solve this problem, with the aid of 3-measurement in temporal structure, we theoretically show that both latent variables and processes can be identified up to minor indeterminacy under mild assumptions. Moreover, we tackle the general nonlinear Causal Discovery (CD) from observations, e.g., temperature, as a specific task of learning independent representation, through the principle of functional equivalence. Based on these insights, we develop an estimation approach simultaneously recovering both the observed causal structure and latent causal process in a nontrivial manner. Simulation studies validate the theoretical foundations and demonstrate the effectiveness of the proposed methodology. In the experiments involving climate data, this approach offers a powerful and in-depth understanding of the climate system.

[LG-40] he Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution

链接: https://arxiv.org/abs/2501.12407
作者: Frank Sifei Luan,Ziming Mao,Ron Yifeng Wang,Charlotte Lin,Amog Kamsetty,Hao Chen,Cheng Su,Balaji Veeramani,Scott Lee,SangBin Cho,Clark Zinzow,Eric Liang,Ion Stoica,Stephanie Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While ML model training and inference are both GPU-intensive, CPU-based data processing is often the bottleneck. Distributed data processing systems based on the batch or stream processing models assume homogeneous resource requirements. They excel at CPU-based computation but either under-utilize heterogeneous resources or impose high overheads on failure and reconfiguration. We introduce the streaming batch model, a hybrid of the two models that enables efficient and fault-tolerant heterogeneous execution. The key idea is to execute one partition at a time to allow lineage-based recovery with dynamic resource allocation. This enables memory-efficient pipelining across heterogeneous resources, similar to stream processing, but also offers the elasticity and fault tolerance properties of batch processing. We present Ray Data, an implementation of the streaming batch model that improves throughput on heterogeneous batch inference pipelines by 3–8 \times compared to traditional batch and stream processing systems. When training Stable Diffusion, Ray Data matches the throughput of single-node ML data loaders while additionally leveraging distributed heterogeneous clusters to further improve training throughput by 31%.

[LG-41] he ELEVATE-AI LLM s Framework: An Evaluation Framework for Use of Large Language Models in HEOR: an ISPOR Working Group Report

链接: https://arxiv.org/abs/2501.12394
作者: Rachael L. Fleurence,Dalia Dawoud,Jiang Bian,Mitchell K. Higashi,Xiaoyan Wang,Hua Xu,Jagpreet Chhatwal,Turgay Ayer
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 4 Tables, 1 Figure, Supplemental Material

点击查看摘要

Abstract:Introduction. Generative Artificial Intelligence, particularly large language models (LLMs), offers transformative potential for Health Economics and Outcomes Research (HEOR). However, evaluating the quality, transparency, and rigor of LLM-assisted research lacks standardized guidance. This article introduces the ELEVATE AI LLMs framework and checklist, designed to support researchers and reviewers in assessing LLM use in HEOR. Methods. The ELEVATE AI LLMs framework was developed through a targeted review of existing guidelines and evaluation frameworks. The framework comprises ten evaluation domains, including model characteristics, accuracy, comprehensiveness, and fairness. The accompanying checklist operationalizes the framework. To validate the framework, we applied it to two published studies, demonstrating its usability across different HEOR tasks. Results. The ELEVATE AI LLMs framework provides a comprehensive structure for evaluating LLM-assisted research, while the checklist facilitates practical application. Validation of the framework and checklist on studies of systematic literature reviews and health economic modeling highlighted their ability to identify strengths and gaps in reporting. Limitations. While the ELEVATE AI LLMs framework provides robust guidance, its broader generalizability and applicability to diverse HEOR tasks require further empirical testing. Additionally, several metrics adapted from computer science need further validation in HEOR contexts. Conclusion. The ELEVATE AI LLMs framework and checklist fill a critical gap in HEOR by offering structured guidance for evaluating LLM-assisted research. By promoting transparency, accuracy, and reproducibility, they aim to standardize and improve the integration of LLMs into HEOR, ensuring their outputs meet the field’s rigorous standards. Comments: 4 Tables, 1 Figure, Supplemental Material Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2501.12394 [cs.CY] (or arXiv:2501.12394v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2501.12394 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Rachael Fleurence [view email] [v1] Mon, 23 Dec 2024 14:09:10 UTC (499 KB) Full-text links: Access Paper: View a PDF of the paper titled The ELEVATE-AI LLMs Framework: An Evaluation Framework for Use of Large Language Models in HEOR: an ISPOR Working Group Report, by Rachael L. Fleurence and 7 other authorsView PDFOther Formats view license Current browse context: cs.CY prev | next new | recent | 2025-01 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-42] Low-dimensional adaptation of diffusion models: Convergence in total variation

链接: https://arxiv.org/abs/2501.12982
作者: Jiadong Liang,Zhihan Huang,Yuxin Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates how diffusion generative models leverage (unknown) low-dimensional structure to accelerate sampling. Focusing on two mainstream samplers – the denoising diffusion implicit model (DDIM) and the denoising diffusion probabilistic model (DDPM) – and assuming accurate score estimates, we prove that their iteration complexities are no greater than the order of k/\varepsilon (up to some log factor), where \varepsilon is the precision in total variation distance and k is some intrinsic dimension of the target distribution. Our results are applicable to a broad family of target distributions without requiring smoothness or log-concavity assumptions. Further, we develop a lower bound that suggests the (near) necessity of the coefficients introduced by Ho et al.(2020) and Song et al.(2020) in facilitating low-dimensional adaptation. Our findings provide the first rigorous evidence for the adaptivity of the DDIM-type samplers to unknown low-dimensional structure, and improve over the state-of-the-art DDPM theory regarding total variation convergence.

[LG-43] Fixed-Budget Change Point Identification in Piecewise Constant Bandits

链接: https://arxiv.org/abs/2501.12957
作者: Joseph Lazzaro,Ciara Pike-Burke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 44 pages, 7 figures

点击查看摘要

Abstract:We study the piecewise constant bandit problem where the expected reward is a piecewise constant function with one change point (discontinuity) across the action space [0,1] and the learner’s aim is to locate the change point. Under the assumption of a fixed exploration budget, we provide the first non-asymptotic analysis of policies designed to locate abrupt changes in the mean reward function under bandit feedback. We study the problem under a large and small budget regime, and for both settings establish lower bounds on the error probability and provide algorithms with near matching upper bounds. Interestingly, our results show a separation in the complexity of the two regimes. We then propose a regime adaptive algorithm which is near optimal for both small and large budgets simultaneously. We complement our theoretical analysis with experimental results in simulated environments to support our findings.

[LG-44] On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration

链接: https://arxiv.org/abs/2501.12785
作者: Yirui Zhou,Xiaowei Liu,Xiaofeng Zhang,Yangchun Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper tackles the efficiency and stability issues in learning from observations (LfO). We commence by investigating how reward functions and policies generalize in LfO. Subsequently, the built-in reinforcement learning (RL) approach in generative adversarial imitation from observation (GAIfO) is replaced with distributional soft actor-critic (DSAC). This change results in a novel algorithm called Mimicking Observations through Distributional Update Learning with adequate Exploration (MODULE), which combines soft actor-critic’s superior efficiency with distributional RL’s robust stability.

[LG-45] Singular leaning coefficients and efficiency in learning theory

链接: https://arxiv.org/abs/2501.12747
作者: Miki Aoyagi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Algebraic Geometry (math.AG); Statistics Theory (math.ST)
*备注: 12 pages

点击查看摘要

Abstract:Singular learning models with non-positive Fisher information matrices include neural networks, reduced-rank regression, Boltzmann machines, normal mixture models, and others. These models have been widely used in the development of learning machines. However, theoretical analysis is still in its early stages. In this paper, we examine learning coefficients, which indicate the general learning efficiency of deep linear learning models and three-layer neural network models with ReLU units. Finally, we extend the results to include the case of the Softmax function. Comments: 12 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Algebraic Geometry (math.AG); Statistics Theory (math.ST) Cite as: arXiv:2501.12747 [stat.ML] (or arXiv:2501.12747v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.12747 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] he Marginal Importance of Distortions and Alignment in CASSI systems

链接: https://arxiv.org/abs/2501.12705
作者: Léo Paillet(LAAS-PHOTO, LAAS-RIS, IRAP),Antoine Rouxel(LAAS-PHOTO),Hervé Carfantan(IRAP),Simon Lacroix(LAAS-RIS),Antoine Monmayrant(LAAS-PHOTO)
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This paper introduces a differentiable ray-tracing based model that incorporates aberrations and distortions to render realistic coded hyperspectral acquisitions using Coded-Aperture Spectral Snapshot Imagers (CASSI). CASSI systems can now be optimized in order to fulfill simultaneously several optical design constraints as well as processing constraints. Four comparable CASSI systems with varying degree of optical aberrations have been designed and modeled. The resulting rendered hyperspectral acquisitions from each of these systems are combined with five state-of-the-art hyperspectral cube reconstruction processes. These reconstruction processes encompass a mapping function created from each system’s propagation model to account for distortions and aberrations during the reconstruction process. Our analyses show that if properly modeled, the effects of geometric distortions of the system and misalignments of the dispersive elements have a marginal impact on the overall quality of the reconstructed hyperspectral data cubes. Therefore, relaxing traditional constraints on measurement conformity and fidelity to the scene enables the development of novel imaging instruments, guided by performance metrics applied to the design or the processing of acquisitions. By providing a complete framework for design, simulation and evaluation, this work contributes to the optimization and exploration of new CASSI systems, and more generally to the computational imaging community.

[LG-47] Sequential Change Point Detection via Denoising Score Matching

链接: https://arxiv.org/abs/2501.12667
作者: Wenbin Zhou,Liyan Xie,Zhigang Peng,Shixiang Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential change-point detection plays a critical role in numerous real-world applications, where timely identification of distributional shifts can greatly mitigate adverse outcomes. Classical methods commonly rely on parametric density assumptions of pre- and post-change distributions, limiting their effectiveness for high-dimensional, complex data streams. This paper proposes a score-based CUSUM change-point detection, in which the score functions of the data distribution are estimated by injecting noise and applying denoising score matching. We consider both offline and online versions of score estimation. Through theoretical analysis, we demonstrate that denoising score matching can enhance detection power by effectively controlling the injected noise scale. Finally, we validate the practical efficacy of our method through numerical experiments on two synthetic datasets and a real-world earthquake precursor detection task, demonstrating its effectiveness in challenging scenarios.

[LG-48] Ultralow-dimensionality reduction for identifying critical transitions by spatial-temporal PCA

链接: https://arxiv.org/abs/2501.12582
作者: Pei Chen,Yaofang Suo,Rui Liu,Luonan Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discovering dominant patterns and exploring dynamic behaviors especially critical state transitions and tipping points in high-dimensional time-series data are challenging tasks in study of real-world complex systems, which demand interpretable data representations to facilitate comprehension of both spatial and temporal information within the original data space. Here, we proposed a general and analytical ultralow-dimensionality reduction method for dynamical systems named spatial-temporal principal component analysis (stPCA) to fully represent the dynamics of a high-dimensional time-series by only a single latent variable without distortion, which transforms high-dimensional spatial information into one-dimensional temporal information based on nonlinear delay-embedding theory. The dynamics of this single variable is analytically solved and theoretically preserves the temporal property of original high-dimensional time-series, thereby accurately and reliably identifying the tipping point before an upcoming critical transition. Its applications to real-world datasets such as individual-specific heterogeneous ICU records demonstrated the effectiveness of stPCA, which quantitatively and robustly provides the early-warning signals of the critical/tipping state on each patient.

[LG-49] Structural and mechanical properties of W-Cu compounds characterized by a neural-network-based potential

链接: https://arxiv.org/abs/2501.12558
作者: Jianchuan Liu,Tao Chen,Sheng Mao,Mohan Chenb
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tungsten-copper (W-Cu) compounds are widely utilized in various industrial fields due to their exceptional mechanical properties. In this study, we have developed a neural-network-based deep potential (DP) model that covers a wide range of temperatures, ranging from 0 to 3,000 K, and pressures, varying from 0 to 10 GPa. This study presents a model trained using density functional theory data for full concentration CuxW100-x compounds. Through this model, we systematically investigate the structural and mechanical properties of W-Cu alloys and have the following findings. First, the bulk modulus (B) and Young’s modulus (E) of W-Cu alloys exhibit a linear decline as the Cu content increases, indicating a softening trend in the CuxW100-x compounds as the Cu concentration rises. Second, a higher Cu content results in higher critical strain and lower critical stress for these compounds. A brittle-to-ductile transition in the deformation mode predicted is predicted at around 37.5 at. % Cu content. Third, tensile loading tests in the W-Cu gradient structure reveal that Cu-poor region serves as a barrier, hindering shear band propagation while promoting new shear band formation in the Cu-rich region. The above results from the DP model are anticipated to aid in exploring the physical mechanisms underlying the complex phenomena of W-Cu systems and contribute to the advancement of methodologies for materials simulation.

[LG-50] Ensemble score filter with image inpainting for data assimilation in tracking surface quasi-geostrophic dynamics with partial observations

链接: https://arxiv.org/abs/2501.12419
作者: Siming Liang,Hoang Tran,Feng Bao,Hristo G. Chipilski,Peter Jan van Leeuwen,Guannan Zhang
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data assimilation plays a pivotal role in understanding and predicting turbulent systems within geoscience and weather forecasting, where data assimilation is used to address three fundamental challenges, i.e., high-dimensionality, nonlinearity, and partial observations. Recent advances in machine learning (ML)-based data assimilation methods have demonstrated encouraging results. In this work, we develop an ensemble score filter (EnSF) that integrates image inpainting to solve the data assimilation problems with partial observations. The EnSF method exploits an exclusively designed training-free diffusion models to solve high-dimensional nonlinear data assimilation problems. Its performance has been successfully demonstrated in the context of having full observations, i.e., all the state variables are directly or indirectly observed. However, because the EnSF does not use a covariance matrix to capture the dependence between the observed and unobserved state variables, it is nontrivial to extend the original EnSF method to the partial observation scenario. In this work, we incorporate various image inpainting techniques into the EnSF to predict the unobserved states during data assimilation. At each filtering step, we first use the diffusion model to estimate the observed states by integrating the likelihood information into the score function. Then, we use image inpainting methods to predict the unobserved state variables. We demonstrate the performance of the EnSF with inpainting by tracking the Surface Quasi-Geostrophic (SQG) model dynamics under a variety of scenarios. The successful proof of concept paves the way to more in-depth investigations on exploiting modern image inpainting techniques to advance data assimilation methodology for practical geoscience and weather forecasting problems.

[LG-51] Interpolation pour laugmentation de donnees : Application `a la gestion des adventices de la canne a sucre a la Reunion

链接: https://arxiv.org/abs/2501.12400
作者: Frederick Fabre Ferber,Dominique Gay,Jean-Christophe Soulie,Jean Diatta,Odalric-Ambrym Maillard
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Preprint. in French language

点击查看摘要

Abstract:Data augmentation is a crucial step in the development of robust supervised learning models, especially when dealing with limited datasets. This study explores interpolation techniques for the augmentation of geo-referenced data, with the aim of predicting the presence of Commelina benghalensis L. in sugarcane plots in La Réunion. Given the spatial nature of the data and the high cost of data collection, we evaluated two interpolation approaches: Gaussian processes (GPs) with different kernels and kriging with various variograms. The objectives of this work are threefold: (i) to identify which interpolation methods offer the best predictive performance for various regression algorithms, (ii) to analyze the evolution of performance as a function of the number of observations added, and (iii) to assess the spatial consistency of augmented datasets. The results show that GP-based methods, in particular with combined kernels (GP-COMB), significantly improve the performance of regression algorithms while requiring less additional data. Although kriging shows slightly lower performance, it is distinguished by a more homogeneous spatial coverage, a potential advantage in certain contexts.

信息检索

[IR-0] OLS4: A new Ontology Lookup Service for a growing interdisciplinary knowledge ecosystem

链接: https://arxiv.org/abs/2501.13034
作者: James McLaughlin,Josh Lagrimas,Haider Iqbal,Helen Parkinson,Henriette Harmse
类目: Information Retrieval (cs.IR)
*备注: 4 pages plus references

点击查看摘要

Abstract:The Ontology Lookup Service (OLS) is an open source search engine for ontologies which is used extensively in the bioinformatics and chemistry communities to annotate biological and biomedical data with ontology terms. Recently there has been a significant increase in the size and complexity of ontologies due to new scales of biological knowledge, such as spatial transcriptomics, new ontology development methodologies, and curation on an increased scale. Existing Web-based tools for ontology browsing such as BioPortal and OntoBee do not support the full range of definitions used by today’s ontologies. In order to support the community going forward, we have developed OLS4, implementing the complete OWL2 specification, internationalization support for multiple languages, and a new user interface with UX enhancements such as links out to external databases. OLS4 has replaced OLS3 in production at EMBL-EBI and has a backwards compatible API supporting users of OLS3 to transition.

[IR-1] Designing and Evaluating an Educational Recommender System with Different Levels of User Control

链接: https://arxiv.org/abs/2501.12894
作者: Qurat Ul Ain,Mohamed Amine Chatti,William Kana Tsoplefack,Rawaa Alatrash,Shoeb Joarder
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Published in IntRS’24: Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, October 18, 2024, Bari (Italy)

点击查看摘要

Abstract:Educational recommender systems (ERSs) play a crucial role in personalizing learning experiences and enhancing educational outcomes by providing recommendations of personalized resources and activities to learners, tailored to their individual learning needs. However, their effectiveness is often diminished by insufficient user control and limited transparency. To address these challenges, in this paper, we present the systematic design and evaluation of an interactive ERS, in which we introduce different levels of user control. Concretely, we introduce user control around the input (i.e., user profile), process (i.e., recommendation algorithm), and output (i.e., recommendations) of the ERS. To evaluate our system, we conducted an online user study (N=30) to explore the impact of user control on users’ perceptions of the ERS in terms of several important user-centric aspects. Moreover, we investigated the effects of user control on multiple recommendation goals, namely transparency, trust, and satisfaction, as well as the interactions between these goals. Our results demonstrate the positive impact of user control on user perceived benefits of the ERS. Moreover, our study shows that user control strongly correlates with transparency and moderately correlates with trust and satisfaction. In terms of interaction between these goals, our results reveal that transparency moderately correlates and trust strongly correlates with satisfaction. Whereas, transparency and trust stand out as less correlated with each other.

[IR-2] A systematic data characteristic understanding framework towards physical-sensor big data challenges

链接: https://arxiv.org/abs/2501.12720
作者: Zhipeng Ma,Bo Nørregaard Jørgensen,Zheng Grace Ma
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Big data present new opportunities for modern society while posing challenges for data scientists. Recent advancements in sensor networks and the widespread adoption of IoT have led to the collection of physical-sensor data on an enormous scale. However, significant challenges arise in high-quality big data analytics. To uncover big data challenges and enhance data quality, it is essential to quantitatively unveil data characteristics. Furthermore, the existing studies lack analysis of the specific time-related characteristics. Enhancing the efficiency and precision of data analytics through the big data lifecycle requires a comprehensive understanding of data characteristics to address the hidden big data challenges. To fill in the research gap, this paper proposes a systematic data characteristic framework based on a 6Vs model. The framework aims to unveil the data characteristics in terms of data volume, variety, velocity, veracity, value, and variability through a set of statistical indicators. This model improves the objectivity of data characteristic understanding by relying solely on data-driven indicators. The indicators related to time-related characteristics in physical-sensor data are also included. Furthermore, the big data challenges are linked to each dimension of the 6Vs model to gain a quantitative understanding of the data challenges. Finally, a pipeline is developed to implement the proposed framework, and two case studies are conducted to illustrate the process of understanding the physical-sensor data characteristics and making recommendations for data preprocessing to address the big data challenges. The proposed framework is able to analyze the characteristics of all physical-sensor data, therefore, identifying potential challenges in subsequent analytics, and providing recommendations for data preprocessing.

[IR-3] Exploring Wikipedia Gender Diversity Over Time unicodex2013 The Wikipedia Gender Dashboard (WGD)

链接: https://arxiv.org/abs/2501.12610
作者: Yahya Yunus,Tianwa Chen,Gianluca Demartini
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 4 pages, 5 figures

点击查看摘要

Abstract:The Wikipedia editors’ community has been actively pursuing the intent of achieving gender equality. To that end, it is important to explore the historical evolution of underlying gender disparities in Wikipedia articles. This paper presents the Wikipedia Gender Dashboard (WGD), a tool designed to enable the interaction with gender distribution data, including the average age in every subclass of individuals (i.e. Astronauts, Politicians, etc.) over the years. Wikipedia APIs, DBpedia, and Wikidata endpoints were used to query the data to ensure persistent data collection. The WGD was then created with Microsoft Power BI before being embedded on a public website. The analysis of the data available in the WGD found that female articles only represent around 17% of English Wikipedia, but it has been growing steadily over the last 20 years. Meanwhile, the average age across genders decreased over time. WGD also shows that most subclasses of `Person’ are male-dominated. Wikipedia editors can make use of WGD to locate areas with marginalized genders in Wikipedia, and increase their efforts to produce more content providing coverage for those genders to achieve better gender equality in Wikipedia.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-23

目录

概览 (2025-01-23)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载