Arxiv今日论文 | 2025-01-20

本篇博文主要内容为 2025-01-20 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决开放域对话系统中标准语言建模（LM）损失函数在对话建模中的不足问题。尽管LM损失函数在语言生成任务中表现良好，但在对话生成中往往无法有效捕捉对话的核心思想，导致生成的回复可能缺乏意义或可解释性。为此，论文提出了一种新的辅助损失函数——关键词袋（Bag-of-Keywords, BoK）损失，旨在通过预测下一句话中的关键词或关键词汇来捕捉对话的核心思想，从而提升生成回复的质量和可解释性。BoK损失是对传统词袋（Bag-of-Words, BoW）损失的改进，BoW损失预测下一句话中的所有词汇，而BoK损失仅预测关键词汇，从而更专注于对话的核心内容。论文将BoK损失与LM损失结合（BoK-LM损失），并在T5（编码器-解码器架构）和DialoGPT（仅解码器架构）模型中进行训练，实验结果表明，BoK损失的引入显著提升了对话生成的质量，并增强了生成结果的可解释性。此外，BoK-LM损失还被证明在无参考指标评估中与现有最先进的评估指标表现相当。

链接: https://arxiv.org/abs/2501.10328
作者: Suvodip Dey,Maunendra Sankar Desarkar
机构: Indian Institute of Technology Hyderabad (印度理工学院海得拉巴分校)
类目: Computation and Language (cs.CL)
备注: Accepted at SIGDIAL 2024

点击查看摘要

Abstract:The standard language modeling (LM) loss by itself has been shown to be inadequate for effective dialogue modeling. As a result, various training approaches, such as auxiliary loss functions and leveraging human feedback, are being adopted to enrich open-domain dialogue systems. One such auxiliary loss function is Bag-of-Words (BoW) loss, defined as the cross-entropy loss for predicting all the words/tokens of the next utterance. In this work, we propose a novel auxiliary loss named Bag-of-Keywords (BoK) loss to capture the central thought of the response through keyword prediction and leverage it to enhance the generation of meaningful and interpretable responses in open-domain dialogue systems. BoK loss upgrades the BoW loss by predicting only the keywords or critical words/tokens of the next utterance, intending to estimate the core idea rather than the entire response. We incorporate BoK loss in both encoder-decoder (T5) and decoder-only (DialoGPT) architecture and train the models to minimize the weighted sum of BoK and LM (BoK-LM) loss. We perform our experiments on two popular open-domain dialogue datasets, DailyDialog and Persona-Chat. We show that the inclusion of BoK loss improves the dialogue generation of backbone models while also enabling post-hoc interpretability. We also study the effectiveness of BoK-LM loss as a reference-free metric and observe comparable performance to the state-of-the-art metrics on various dialogue evaluation datasets.
zh

[NLP-1] Large language models for automated scholarly paper review: A survey

【速读】：该论文旨在探讨大语言模型（LLMs）在学术论文自动评审（ASPR）中的应用及其带来的挑战。论文首先调查了当前用于ASPR的LLMs，并回顾了LLM技术在解决ASPR相关技术瓶颈方面的进展。接着，论文探讨了伴随LLMs出现的新方法、新数据集、新源代码和新在线系统。此外，论文总结了LLMs在ASPR中的表现和问题，并调查了出版商和学术界对ASPR的态度和反应。最后，论文讨论了LLMs在ASPR发展中面临的挑战。解决方案的关键在于通过全面的调查和分析，为研究人员提供启发性的参考，推动ASPR的实际应用和进一步发展。

链接: https://arxiv.org/abs/2501.10326
作者: Zhenzhen Zhuang,Jiandong Chen,Hongfeng Xu,Yuwen Jiang,Jialiang Lin
机构: School of Computer Science and Engineering, Guangzhou Institute of Science and Technology, Guangzhou, China(广州科技学院计算机科学与工程学院); School of Economics and Management, Guizhou Normal University, Guiyang, China(贵州师范大学经济与管理学院); School of Artificial Intelligence, Guangzhou Institute of Science and Technology, Guangzhou, China(广州科技学院人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have significantly impacted human society, influencing various domains. Among them, academia is not simply a domain affected by LLMs, but it is also the pivotal force in the development of LLMs. In academic publications, this phenomenon is represented during the incorporation of LLMs into the peer review mechanism for reviewing manuscripts. We proposed the concept of automated scholarly paper review (ASPR) in our previous paper. As the incorporation grows, it now enters the coexistence phase of ASPR and peer review, which is described in that paper. LLMs hold transformative potential for the full-scale implementation of ASPR, but they also pose new issues and challenges that need to be addressed. In this survey paper, we aim to provide a holistic view of ASPR in the era of LLMs. We begin with a survey to find out which LLMs are used to conduct ASPR. Then, we review what ASPR-related technological bottlenecks have been solved with the incorporation of LLM technology. After that, we move on to explore new methods, new datasets, new source code, and new online systems that come with LLMs for ASPR. Furthermore, we summarize the performance and issues of LLMs in ASPR, and investigate the attitudes and reactions of publishers and academia to ASPR. Lastly, we discuss the challenges associated with the development of LLMs for ASPR. We hope this survey can serve as an inspirational reference for the researchers and promote the progress of ASPR for its actual implementation.
zh

[NLP-2] Hierarchical Autoregressive Transformers: Combining Byte-~and Word-Level Processing for Robust Adaptable Language Models

【速读】：该论文试图解决自然语言处理（NLP）中传统子词分词器（subword tokenizer）所面临的挑战，包括大词汇量、对新领域或语言的适应性有限、以及对拼写错误和变体的敏感性。为解决这些问题，论文提出了一种层次化架构的自回归语言模型，结合了字符级和词级处理。该方案的关键在于使用轻量级的字符级编码器将字符序列转换为词嵌入（word embeddings），然后通过词级骨干模型进行处理，并通过紧凑的字符级解码器将结果解码回字符。这种方法在保留词级分词的序列压缩优势的同时，避免了依赖预定义的固定词汇表。实验表明，该模型在高达70亿参数的规模下，能够与基于子词分词器的模型在下游任务性能上相媲美，同时展现出对输入扰动的显著更强的鲁棒性。此外，在跨领域语言的持续预训练中，该模型的训练速度几乎快了一倍，并在目标语言上表现出更优的性能，同时保留了更多先前学到的知识。层次化Transformer为构建更鲁棒、灵活且跨语言和领域通用的NLP系统铺平了道路。

链接: https://arxiv.org/abs/2501.10322
作者: Pit Neitemeier,Björn Deiseroth,Constantin Eichenberg,Lukas Balles
机构: Aleph Alpha Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large vocabularies, limited adaptability to new domains or languages, and sensitivity to spelling errors and variations. To overcome these limitations, we investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. It employs a lightweight character-level encoder to convert character sequences into word embeddings, which are then processed by a word-level backbone model and decoded back into characters via a compact character-level decoder. This method retains the sequence compression benefits of word-level tokenization without relying on a rigid, predefined vocabulary. We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models while exhibiting significantly greater robustness to input perturbations. Additionally, during continued pretraining on an out-of-domain language, our model trains almost twice as fast, achieves superior performance on the target language, and retains more of its previously learned knowledge. Hierarchical transformers pave the way for NLP systems that are more robust, flexible, and generalizable across languages and domains.
zh

[NLP-3] Natural Language Processing of Privacy Policies: A Survey

【速读】：该论文旨在解决隐私政策（privacy policies）在用户通知和披露方面的挑战，特别是在如何通过自然语言处理（Natural Language Processing, NLP）技术提升隐私通知的可用性和有效性。论文通过对109篇相关文献的系统分析，探讨了NLP在隐私政策中的应用现状、现有方法的有效性以及未来研究的方向。关键解决方案包括：1）通过NLP技术对隐私文本进行注释和分类，以提升隐私政策的可理解性；2）识别并增强现有方法，以提供更健壮的隐私政策；3）指出当前研究中的空白，特别是在文本摘要（summarization）、语料库生成（corpus generation）、上下文词嵌入（contextualized word embedding）和细粒度分类（fine-grained classification）等方面的研究机会。

链接: https://arxiv.org/abs/2501.10319
作者: Andrick Adhikari,Sanchari Das,Rinku Dewri
机构: University of Denver(丹佛大学); George Mason University(乔治梅森大学)
类目: Computation and Language (cs.CL)
备注: 27 pages

点击查看摘要

Abstract:Natural Language Processing (NLP) is an essential subset of artificial intelligence. It has become effective in several domains, such as healthcare, finance, and media, to identify perceptions, opinions, and misuse, among others. Privacy is no exception, and initiatives have been taken to address the challenges of usable privacy notifications to users with the help of NLP. To this aid, we conduct a literature review by analyzing 109 papers at the intersection of NLP and privacy policies. First, we provide a brief introduction to privacy policies and discuss various facets of associated problems, which necessitate the application of NLP to elevate the current state of privacy notices and disclosures to users. Subsequently, we a) provide an overview of the implementation and effectiveness of NLP approaches for better privacy policy communication; b) identify the methodologies that can be further enhanced to provide robust privacy policies; and c) identify the gaps in the current state-of-the-art research. Our systematic analysis reveals that several research papers focus on annotating and classifying privacy texts for analysis but need to adequately dwell on other aspects of NLP applications, such as summarization. More specifically, ample research opportunities exist in this domain, covering aspects such as corpus generation, summarization vectors, contextualized word embedding, identification of privacy-relevant statement categories, fine-grained classification, and domain-specific model tuning.
zh

[NLP-4] owards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling

【速读】：该论文试图解决大型语言模型（LLMs）在任务导向对话系统中产生幻觉（hallucination）问题，即生成看似合理但事实不正确的回答，以及用户过度依赖这些AI代理的问题。解决方案的关键在于引入一种问责模型（accountability model），该模型通过在模型不确定性和对话状态跟踪（DST）错误的情况下增加摩擦轮次（friction turns）来减少用户的过度依赖。具体来说，问责模型是一个增强的LLM，配备了一个额外的问责头（accountability head），作为二元分类器来预测对话状态的槽位（slots）。通过在三个骨干LLMs（Llama、Mistral、Gemma）和两个任务导向数据集（MultiWOZ和Snips）上的实验，该模型不仅能够可靠地估计AI代理的错误，还能指导LLM解码器生成更准确的动作。实验结果表明，引入问责头后，MultiWOZ数据集的联合目标准确率提高了约3%，并且该方法还能使代理自我纠正其动作，进一步提升了3%的性能。

链接: https://arxiv.org/abs/2501.10316
作者: Suvodip Dey,Yi-Jyun Sun,Gokhan Tur,Dilek Hakkani-Tur
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent LLMs have enabled significant advancements for conversational agents. However, they are also well-known to hallucinate, i.e., they often produce responses that seem plausible but are not factually correct. On the other hand, users tend to over-rely on LLM-based AI agents; they accept the AI’s suggestion even when it is wrong. Adding good friction, such as explanations or getting user confirmations, has been proposed as a mitigation in AI-supported decision-making systems. In this paper, we propose an accountability model for LLM-based task-oriented dialogue agents to address user overreliance via friction turns in cases of model uncertainty and errors associated with dialogue state tracking (DST). The accountability model is an augmented LLM with an additional accountability head, which functions as a binary classifier to predict the slots of the dialogue states. We perform our experiments with three backbone LLMs (Llama, Mistral, Gemma) on two established task-oriented datasets (MultiWOZ and Snips). Our empirical findings demonstrate that this approach not only enables reliable estimation of AI agent errors but also guides the LLM decoder in generating more accurate actions. We observe around 3% absolute improvement in joint goal accuracy by incorporating accountability heads in modern LLMs for the MultiWOZ dataset. We also show that this method enables the agent to self-correct its actions, further boosting its performance by 3%. Finally, we discuss the application of accountability modeling to prevent user overreliance by introducing friction.
zh

[NLP-5] Computational Protein Science in the Era of Large Language Models (LLM s)

【速读】：该论文旨在系统综述大语言模型（LLMs）在计算蛋白质科学中的应用，特别是如何利用蛋白质语言模型（pLMs）解决蛋白质序列-结构-功能范式中的一系列问题。论文首先总结了现有的pLMs，并根据它们所掌握的蛋白质知识（如序列模式、结构和功能信息以及外部科学语言）进行了分类。其次，论文介绍了pLMs在蛋白质结构预测、功能预测和蛋白质设计等任务中的显著成就，并探讨了其在抗体设计、酶设计和药物发现中的实际应用。最后，论文讨论了该领域未来的发展方向。解决方案的关键在于利用LLMs的泛化能力，开发出能够全面理解蛋白质基础知识的pLMs，并将其应用于广泛的蛋白质建模任务中，从而推动计算蛋白质科学的整体进展。

链接: https://arxiv.org/abs/2501.10282
作者: Wenqi Fan,Yi Zhou,Shijie Wang,Yuyao Yan,Hui Liu,Qian Zhao,Le Song,Qing Li
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Considering the significance of proteins, computational protein science has always been a critical scientific field, dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. In the last few decades, Artificial Intelligence (AI) has made significant impacts in computational protein science, leading to notable successes in specific protein modeling tasks. However, those previous AI models still meet limitations, such as the difficulty in comprehending the semantics of protein sequences, and the inability to generalize across a wide range of protein modeling tasks. Recently, LLMs have emerged as a milestone in AI due to their unprecedented language processing generalization capability. They can promote comprehensive progress in fields rather than solving individual tasks. As a result, researchers have actively introduced LLM techniques in computational protein science, developing protein Language Models (pLMs) that skillfully grasp the foundational knowledge of proteins and can be effectively generalized to solve a diversity of sequence-structure-function reasoning problems. While witnessing prosperous developments, it’s necessary to present a systematic overview of computational protein science empowered by LLM techniques. First, we summarize existing pLMs into categories based on their mastered protein knowledge, i.e., underlying sequence patterns, explicit structural and functional information, and external scientific languages. Second, we introduce the utilization and adaptation of pLMs, highlighting their remarkable achievements in promoting protein structure prediction, protein function prediction, and protein design studies. Then, we describe the practical application of pLMs in antibody design, enzyme design, and drug discovery. Finally, we specifically discuss the promising future directions in this fast-growing field.
zh

[NLP-6] A Simple but Effective Closed-form Solution for Extreme Multi-label Learning ECIR25

【速读】：该论文试图解决极端多标签学习（Extreme Multi-label Learning, XML）中模型超参数过多导致的调优复杂性和模型重新实现困难的问题。解决方案的关键在于提出了一种基于岭回归（Ridge Regression）的简单方法。该方法不仅具有闭式解（closed-form solution），而且仅包含一个超参数，极大地简化了调优过程。此外，论文还通过引入基于频率的加权方法，显著提升了低频标签的预测性能，这些标签通常包含丰富的信息但数据量有限。实验结果表明，该方法在多个XML基准数据集上的性能与现有复杂模型相当甚至更优，且实现简单，几乎无需额外调整。

链接: https://arxiv.org/abs/2501.10179
作者: Kazuma Onishi,Katsuhiko Hayashi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10pages, Accepted at ECIR25

点击查看摘要

Abstract:Extreme multi-label learning (XML) is a task of assigning multiple labels from an extremely large set of labels to each data instance. Many current high-performance XML models are composed of a lot of hyperparameters, which complicates the tuning process. Additionally, the models themselves are adapted specifically to XML, which complicates their reimplementation. To remedy this problem, we propose a simple method based on ridge regression for XML. The proposed method not only has a closed-form solution but also is composed of a single hyperparameter. Since there are no precedents on applying ridge regression to XML, this paper verified the performance of the method by using various XML benchmark datasets. Furthermore, we enhanced the prediction of low-frequency labels in XML, which hold informative content. This prediction is essential yet challenging because of the limited amount of data. Here, we employed a simple frequency-based weighting. This approach greatly simplifies the process compared with existing techniques. Experimental results revealed that it can achieve levels of performance comparable to, or even exceeding, those of models with numerous hyperparameters. Additionally, we found that the frequency-based weighting significantly improved the predictive performance for low-frequency labels, while requiring almost no changes in implementation. The source code for the proposed method is available on github at this https URL.
zh

[NLP-7] Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval

【速读】：该论文旨在解决在伊斯兰领域（Islamic domain）中应用自然语言处理（Natural Language Processing, NLP）技术时面临的挑战，特别是由于该领域的大规模语料库主要存在于阿拉伯语中，而其他语言（如英语）的语料库较为有限。论文提出了一种基于XLM-R模型的轻量级双语大语言模型（LLM），通过语言缩减技术（language reduction technique）和多阶段训练过程（multi-stage training process）来提升检索模型的性能。关键解决方案包括利用大规模检索数据集（如MS MARCO）和较小的领域内数据集进行训练，并通过数据增强技术（data augmentation）和可靠的伊斯兰来源构建了一个英语领域内检索数据集。最终，结合领域适应（domain adaptation）和多阶段训练方法，该双语伊斯兰神经检索模型在下游检索任务中表现优于单语模型。

链接: https://arxiv.org/abs/2501.10175
作者: Vera Pavlova
机构: rttl labs, UAE
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.10175 [cs.CL] (or arXiv:2501.10175v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.10175 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-8] Dual Debiasing: Remove Stereotypes and Keep Factual Gender for Fair Language Modeling and Translation

【速读】：该论文旨在解决语言模型（language models）中存在的性别刻板印象（gender stereotypes）偏见问题，以确保模型在保持其多功能性的同时，能够公平地代表不同性别。为解决这一问题，作者提出了一种名为“双重去偏算法通过模型适应”（Dual Debiasing Algorithm through Model Adaptation, 2DAMA）的简化方法。该算法的关键创新在于能够在减少刻板印象偏见的同时，保留语言模型中编码的有用性别信息（factual gender cues），从而确保模型在自然语言处理任务中的表现不受影响。2DAMA不仅在英语中有效减少了性别偏见，还首次在翻译任务中缓解了刻板印象倾向。

链接: https://arxiv.org/abs/2501.10150
作者: Tomasz Limisiewicz,David Mareček,Tomáš Musil
机构: Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic (形式与应用语言学研究所，数学与物理学院，查理大学，布拉格，捷克共和国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mitigation of biases, such as language models’ reliance on gender stereotypes, is a crucial endeavor required for the creation of reliable and useful language technology. The crucial aspect of debiasing is to ensure that the models preserve their versatile capabilities, including their ability to solve language tasks and equitably represent various genders. To address this issue, we introduce a streamlined Dual Dabiasing Algorithm through Model Adaptation (2DAMA). Novel Dual Debiasing enables robust reduction of stereotypical bias while preserving desired factual gender information encoded by language models. We show that 2DAMA effectively reduces gender bias in English and is one of the first approaches facilitating the mitigation of stereotypical tendencies in translation. The proposed method’s key advantage is the preservation of factual gender cues, which are useful in a wide range of natural language processing tasks.
zh

[NLP-9] ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

【速读】：该论文试图解决大型语言模型（LLMs）在现实场景中复杂函数调用能力评估不足的问题。现有的评估方法难以应对多步骤和受限函数调用的复杂性，尤其是在需要长参数填充、参数值推理和128k长上下文的情况下。为此，论文提出了ComplexFuncBench，一个涵盖五种现实场景的复杂函数调用基准测试，并开发了ComplexEval自动评估框架，用于定量评估复杂函数调用任务。通过实验，论文揭示了当前最先进的LLMs在函数调用方面的不足，并提出了优化这些能力的未来方向。

链接: https://arxiv.org/abs/2501.10132
作者: Lucen Zhong,Zhengxiao Du,Xiaohan Zhang,Haiyi Hu,Jie Tang
机构: Zhipu AI; Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at \urlthis https URL.
zh

[NLP-10] BBPOS: BERT-based Part-of-Speech Tagging for Uzbek

【速读】：该论文旨在解决低资源乌兹别克语（Uzbek）在自然语言处理（NLP）领域中的词性标注（POS tagging）问题。具体而言，论文通过评估两种先前未经测试的单语乌兹别克语BERT模型，并引入首个公开可用的UPOS标注基准数据集，来提升乌兹别克语的词性标注性能。解决方案的关键在于使用微调后的单语BERT模型，这些模型在词性标注任务中达到了91%的平均准确率，显著优于基线多语言BERT模型和基于规则的标注器。此外，这些模型能够通过词缀捕捉中间词性变化，并表现出对上下文的敏感性，这是现有基于规则的标注器所不具备的。

链接: https://arxiv.org/abs/2501.10107
作者: Latofat Bobojonova,Arofat Akhundjanova,Phil Ostheimer,Sophie Fellenz
机构: RPTU Kaiserslautern-Landau; Saarland University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper advances NLP research for the low-resource Uzbek language by evaluating two previously untested monolingual Uzbek BERT models on the part-of-speech (POS) tagging task and introducing the first publicly available UPOS-tagged benchmark dataset for Uzbek. Our fine-tuned models achieve 91% average accuracy, outperforming the baseline multi-lingual BERT as well as the rule-based tagger. Notably, these models capture intermediate POS changes through affixes and demonstrate context sensitivity, unlike existing rule-based taggers.
zh

[NLP-11] Author-Specific Linguistic Patterns Unveiled: A Deep Learning Study on Word Class Distributions

【速读】：该论文旨在通过计算语言学方法解决作者风格识别的问题，具体研究如何利用词性标注（POS tagging）和二元语法（bigram）分析来揭示作者特定的词类分布模式。解决方案的关键在于使用深度神经网络（deep neural networks）对从作者作品中提取的词性标注向量和二元语法频率矩阵进行分类。研究采用了全连接神经网络和卷积神经网络架构，探讨了基于单字（unigram）和二元语法特征表示的有效性。结果表明，虽然单字特征能够达到中等分类准确率，但基于二元语法的模型显著提升了性能，表明连续的词类模式更能体现作者风格的独特性。此外，通过多维尺度分析（MDS）可视化，作者的作品呈现出有意义的聚类，进一步支持了计算语言学方法能够捕捉风格细微差异的假设。这些发现凸显了深度学习和语言特征分析在作者画像和文学研究中的潜力。

链接: https://arxiv.org/abs/2501.10072
作者: Patrick Krauss,Achim Schilling
机构: University Hospital Erlangen(埃尔朗根大学医院); FAU Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep learning methods have been increasingly applied to computational linguistics to uncover patterns in text data. This study investigates author-specific word class distributions using part-of-speech (POS) tagging and bigram analysis. By leveraging deep neural networks, we classify literary authors based on POS tag vectors and bigram frequency matrices derived from their works. We employ fully connected and convolutional neural network architectures to explore the efficacy of unigram and bigram-based representations. Our results demonstrate that while unigram features achieve moderate classification accuracy, bigram-based models significantly improve performance, suggesting that sequential word class patterns are more distinctive of authorial style. Multi-dimensional scaling (MDS) visualizations reveal meaningful clustering of authors’ works, supporting the hypothesis that stylistic nuances can be captured through computational methods. These findings highlight the potential of deep learning and linguistic feature analysis for author profiling and literary studies.
zh

[NLP-12] OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning

【速读】：该论文试图解决在低秩适应（LoRA）中构建专家混合（MoE）架构时，专家之间缺乏多样性导致性能受限的问题。传统的MoE架构中，专家容易收敛到相似的表示，限制了模块化设计和计算效率的提升。论文提出的解决方案是关键是通过正交专家混合（OMoE）方法，利用Gram-Schmidt过程强制专家的表示位于Stiefel流形上，从而促进专家之间的多样性。OMoE在保持学习目标不变的同时，直接对架构施加正交约束，避免了最优性的损失。该方法简单且减少了内存瓶颈，相较于传统MoE模型，显著减少了所需的专家数量。实验结果表明，OMoE在多个常识推理基准测试中能够稳定且高效地提升性能。

链接: https://arxiv.org/abs/2501.10062
作者: Jinyuan Feng,Zhiqiang Pu,Tianyi Hu,Dongmin Li,Xiaolin Ai,Huimu Wang
机构: Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); JD.COM(京东)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) for its modular design and remarkable performance. However, simply stacking the number of experts cannot guarantee significant improvement. In this work, we first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Ulteriorly, Our analysis reveals that the performance of previous MoE variants maybe limited by a lack of diversity among experts. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE), a resource-efficient MoE variant that trains experts in an orthogonal manner to promote diversity. In OMoE, a Gram-Schmidt process is leveraged to enforce that the experts’ representations lie within the Stiefel manifold. By applying orthogonal constraints directly to the architecture, OMoE keeps the learning objective unchanged, without compromising optimality. Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models. Experiments on diverse commonsense reasoning benchmarks demonstrate that OMoE can consistently achieve stable and efficient performance improvement when compared with the state-of-the-art methods while significantly reducing the number of required experts.
zh

[NLP-13] MSTS: A Multimodal Safety Test Suite for Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在集成到聊天助手和其他消费者AI应用中的安全性问题。由于缺乏适当的安全保障措施，VLMs可能会提供有害建议（如自残方法）或鼓励不安全行为（如吸毒）。尽管这些风险显而易见，目前很少有研究评估VLMs的安全性以及多模态输入带来的新型风险。为解决这一问题，论文提出了MSTS（Multimodal Safety Test Suite），一个用于评估VLMs安全性的多模态安全测试套件。MSTS包含400个测试提示，涵盖40个细粒度的危险类别。每个测试提示由文本和图像组成，只有两者结合才能完全揭示其不安全含义。通过MSTS，作者发现多个开源VLMs存在明显的安全问题，并指出某些VLMs的安全性是由于未能理解简单的测试提示而偶然实现的。此外，作者将MSTS翻译成十种语言，发现非英语提示会增加模型的不安全响应率。研究还表明，仅使用文本提示而非多模态提示时，模型表现更为安全。最后，作者探讨了VLM安全性评估的自动化，发现即使是最好的安全分类器也存在不足。

链接: https://arxiv.org/abs/2501.10057
作者: Paul Röttger,Giuseppe Attanasio,Felix Friedrich,Janis Goldzycher,Alicia Parrish,Rishabh Bhardwaj,Chiara Di Bonaventura,Roman Eng,Gaia El Khoury Geagea,Sujata Goswami,Jieun Han,Dirk Hovy,Seogyeong Jeong,Paloma Jeretič,Flor Miriam Plaza-del-Arco,Donya Rooein,Patrick Schramowski,Anastassia Shaitarova,Xudong Shen,Richard Willats,Andrea Zugarini,Bertie Vidgen
机构: Bocconi University; Instituto de Telecomunicações; TU Darmstadt; Hessian.AI; University of Zurich; Google DeepMind; Walled AI; King’s College London; Imperial College London; Clarkson University; Lawrence Berkeley National Laboratory; KAIST; University of Pennsylvania; DFKI; CERTAIN; National University of Singapore; Contextual AI; Expert.ai
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.
zh

[NLP-14] Automatic Speech Recognition for Sanskrit with Transfer Learning

【速读】：该论文试图解决梵语（Sanskrit）在数字内容（音频和文本）方面的稀缺性问题，这些问题限制了生成式 AI 系统的训练，并且由于梵语复杂的语言学特性，开发强大的自然语言处理（NLP）工具以提升其可访问性面临挑战。为解决这些问题，作者提出了一种基于 OpenAI 的 Whisper 模型的自动语音识别（ASR）系统，通过迁移学习机制对模型进行优化。经过超参数调优后，该模型在 Vaksancayah 数据集上实现了 15.42% 的词错误率（WER），并通过在线演示向公众开放，以促进梵语学习的现代技术支持和可访问性提升。解决方案的关键在于利用迁移学习机制优化 Whisper 模型，从而在有限的梵语数据资源下实现较高的识别精度。

链接: https://arxiv.org/abs/2501.10024
作者: Bidit Sadhukhan,Swami Punyeshwarananda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper has been accepted at the 4th International Conference on Computer, Communication, Control Information Technology (C3IT), Hooghly, India, 2024, pp. 1-5

点击查看摘要

Abstract:Sanskrit, one of humanity’s most ancient languages, has a vast collection of books and manuscripts on diverse topics that have been accumulated over millennia. However, its digital content (audio and text), which is vital for the training of AI systems, is profoundly limited. Furthermore, its intricate linguistics make it hard to develop robust NLP tools for wider accessibility. Given these constraints, we have developed an automatic speech recognition model for Sanskrit by employing transfer learning mechanism on OpenAI’s Whisper model. After carefully optimising the hyper-parameters, we obtained promising results with our transfer-learned model achieving a word error rate of 15.42% on Vaksancayah dataset. An online demo of our model is made available for the use of public and to evaluate its performance firsthand thereby paving the way for improved accessibility and technological support for Sanskrit learning in the modern era.
zh

[NLP-15] Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models

【速读】：该论文旨在解决大语言模型（LLMs）中的幻觉（hallucination）问题，即模型生成与输入内容不一致或虚构的信息。为了解决这一问题，作者提出了一种新颖的注意力引导自反思（Attention-Guided SElf-Reflection, AGSER）方法，用于零样本幻觉检测。该方案的关键在于利用注意力贡献将输入查询分为“注意力集中”和“注意力分散”两类，并分别通过LLMs进行处理，从而计算生成响应与原始答案之间的一致性分数。通过比较这两类查询的一致性分数差异，AGSER能够有效估计幻觉程度。此外，该方法显著降低了计算复杂度，仅需三次LLM推理和两组标记（tokens）即可完成检测。实验结果表明，AGSER在多个广泛使用的LLMs和三个不同的幻觉基准测试中，显著优于现有的零样本幻觉检测方法。

链接: https://arxiv.org/abs/2501.09997
作者: Qiang Liu,Xinlong Chen,Yue Ding,Shizhen Xu,Shu Wu,Liang Wang
机构: New Laboratory of Pattern Recognition (NLPR)(模式识别新实验室), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS)(多模态人工智能系统国家重点实验室), Institute of Automation, Chinese Academy of Sciences (CASIA)(中国科学院自动化研究所); RealAI(瑞莱智慧)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucination has emerged as a significant barrier to the effective application of Large Language Models (LLMs). In this work, we introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in LLMs. The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. Each query is then processed separately through the LLMs, allowing us to compute consistency scores between the generated responses and the original answer. The difference between the two consistency scores serves as a hallucination estimator. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational complexity, requiring only three passes through the LLM and utilizing two sets of tokens. We have conducted extensive experiments with four widely-used LLMs across three different hallucination benchmarks, demonstrating that our approach significantly outperforms existing methods in zero-shot hallucination detection.
zh

[NLP-16] Agent -as-Judge for Factual Summarization of Long Narratives

【速读】：该论文试图解决大型语言模型（LLMs）在生成长叙事文本（100K tokens）摘要时，传统评估指标（如ROUGE和BERTScore）无法充分捕捉摘要质量的关键问题，特别是事实准确性（factual accuracy）方面的不足。现有的LLM-as-a-Judge方法虽然部分弥补了基于词汇相似度的评估指标的局限性，但在理解角色关系和状态时仍存在事实不一致性。为此，论文提出了NarrativeFactScore，一种基于“Agent-as-a-Judge”框架的评估方法。该框架通过从输入文本和生成摘要中提取角色知识图谱（Character Knowledge Graph, CKG），评估摘要的事实一致性，并提供可操作的改进建议，例如识别缺失或错误的事实。NarrativeFactScore通过详细的工作流程展示和广泛验证，证明了其在提升LLM生成摘要的事实可靠性方面的优越性能。

链接: https://arxiv.org/abs/2501.09993
作者: Yeonseok Jeong,Minsoo Kim,Seung-won Hwang,Byung-Hak Kim
机构: IPAI, Seoul National University (首尔国立大学); Seoul National University (首尔国立大学); CJ Group (CJ集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated near-human performance in summarization tasks based on traditional metrics such as ROUGE and BERTScore. However, these metrics do not adequately capture critical aspects of summarization quality, such as factual accuracy, particularly for long narratives (100K tokens). Recent advances, such as LLM-as-a-Judge, address the limitations of metrics based on lexical similarity but still exhibit factual inconsistencies, especially in understanding character relationships and states. In this work, we introduce NarrativeFactScore, a novel “Agent-as-a-Judge” framework for evaluating and refining summaries. By leveraging a Character Knowledge Graph (CKG) extracted from input and generated summaries, NarrativeFactScore assesses the factual consistency and provides actionable guidance for refinement, such as identifying missing or erroneous facts. We demonstrate the effectiveness of NarrativeFactScore through a detailed workflow illustration and extensive validation on widely adopted benchmarks, achieving superior performance compared to competitive methods. Our results highlight the potential of agent-driven evaluation systems to improve the factual reliability of LLM-generated summaries.
zh

[NLP-17] RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation

【速读】：该论文试图解决文本到视频生成模型在生成具有复杂特征的视频时面临的挑战，特别是由于文本编码器无法生成准确的嵌入（embeddings），从而限制了视频生成模型的性能。解决方案的关键在于通过在嵌入空间中进行插值（interpolation）来选择最优的文本嵌入，从而提升视频生成的质量。此外，论文提出了一种基于垂足嵌入（perpendicular foot embeddings）和余弦相似度（cosine similarity）的简单算法，用于识别最优的插值嵌入。这一方法强调了准确文本嵌入的重要性，并为提升文本到视频生成模型的性能提供了新的途径。

链接: https://arxiv.org/abs/2501.09982
作者: Yuefan Cao,Chengyue Gong,Xiaoyu Li,Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song
机构: Zhejiang University(浙江大学); The University of Texas at Austin(德克萨斯大学奥斯汀分校); Independent Researcher(独立研究员); The University of Hong Kong(香港大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Tsinghua University(清华大学); Simons Institute for the Theory of Computing, University of California, Berkeley(加州大学伯克利分校西蒙斯理论计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
zh

[NLP-18] A Survey on Multi-Turn Interaction Capabilities of Large Language Models

【速读】：该论文旨在探讨大语言模型（LLMs）在多轮对话系统中的能力，特别是其在维持上下文连贯性和生成相关响应方面的表现。论文通过综述当前研究，聚焦于四个关键方面：核心模型能力、多轮对话的评估方法、增强多轮对话的算法以及未来研究方向。解决方案的关键在于深入理解LLMs在多轮对话中的核心能力，并通过改进评估方法和算法来提升其在实际应用中的表现，如对话搜索与推荐、咨询服务以及互动教学等。

链接: https://arxiv.org/abs/2501.09959
作者: Chen Zhang,Xinyi Dai,Yaxiong Wu,Qu Yang,Yasheng Wang,Ruiming Tang,Yong Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注: Draft Version, 14 pages, Ongoing refinement over time

点击查看摘要

Abstract:Multi-turn interaction in the dialogue system research refers to a system’s ability to maintain context across multiple dialogue turns, enabling it to generate coherent and contextually relevant responses. Recent advancements in large language models (LLMs) have significantly expanded the scope of multi-turn interaction, moving beyond chatbots to enable more dynamic agentic interactions with users or environments. In this paper, we provide a focused review of the multi-turn capabilities of LLMs, which are critical for a wide range of downstream applications, including conversational search and recommendation, consultation services, and interactive tutoring. This survey explores four key aspects: (1) the core model capabilities that contribute to effective multi-turn interaction, (2) how multi-turn interaction is evaluated in current practice, (3) the general algorithms used to enhance multi-turn interaction, and (4) potential future directions for research in this field.
zh

[NLP-19] FRAG : A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs

【速读】：该论文旨在解决大型语言模型（LLMs）中的幻觉（hallucination）和知识不足（knowledge deficiency）问题。现有的基于知识图谱（Knowledge Graph, KG）的检索增强生成（Retrieval-Augmented Generation, RAG）方法在灵活性和检索质量之间存在权衡。一些方法通过避免使用KG微调模型来提高灵活性，但导致固定的检索策略和次优的检索质量；而另一些方法通过将KG信息嵌入模型来提高检索质量，但牺牲了灵活性。为此，论文提出了一种新颖的模块化KG-RAG框架，称为FRAG（Flexible Retrieval-Augmented Generation）。FRAG通过仅基于查询估计推理路径的跳数范围，并将其分类为简单或复杂，从而根据查询的复杂性应用定制的检索管道，确保高效且准确的推理路径检索。FRAG通过使用查询文本而非KG来推断推理路径的结构信息，并采用可适应的检索策略，提高了检索质量，同时保持了灵活性。此外，FRAG不需要额外的LLM微调或调用，显著提升了效率并减少了资源消耗。实验结果表明，FRAG在高效性和低资源消耗方面达到了最先进的性能。

链接: https://arxiv.org/abs/2501.09957
作者: Zengyi Gao,Yukun Cao,Hairu Wang,Ao Ke,Yuan Feng,Xike Xie,S Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学); Data Darkness Lab, MIRACLE Center, USTC (数据黑暗实验室, MIRACLE中心, 中国科学技术大学); MIRACLE Center, USTC (MIRACLE中心, 中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To mitigate the hallucination and knowledge deficiency in large language models (LLMs), Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) has shown promising potential by utilizing KGs as external resource to enhance LLMs this http URL, existing KG-RAG approaches struggle with a trade-off between flexibility and retrieval this http URL methods prioritize flexibility by avoiding the use of KG-fine-tuned models during retrieval, leading to fixed retrieval strategies and suboptimal retrieval this http URL, coupled methods embed KG information within models to improve retrieval quality, but at the expense of this http URL this paper, we propose a novel flexible modular KG-RAG framework, termed FRAG, which synergizes the advantages of both this http URL estimates the hop range of reasoning paths based solely on the query and classify it as either simple or this http URL match the complexity of the query, tailored pipelines are applied to ensure efficient and accurate reasoning path retrieval, thus fostering the final reasoning this http URL using the query text instead of the KG to infer the structural information of reasoning paths and employing adaptable retrieval strategies, FRAG improves retrieval quality while maintaining this http URL, FRAG does not require extra LLMs fine-tuning or calls, significantly boosting efficiency and conserving this http URL experiments show that FRAG achieves state-of-the-art performance with high efficiency and low resource consumption.
zh

[NLP-20] Sympathy over Polarization: A Computational Discourse Analysis of Social Media Posts about the July 2024 Trump Assassination Attempt

【速读】：该论文旨在研究2024年7月13日宾夕法尼亚州特朗普集会上针对共和党总统候选人唐纳德·特朗普的暗杀未遂事件对公众舆论和讨论主题的短期影响。具体而言，研究围绕三个关键问题展开：首先，探讨公众对特朗普的情绪如何随时间推移和地区差异而变化（RQ1）；其次，分析暗杀未遂事件本身是否显著影响公众态度，独立于现有的政治立场（RQ2）；最后，探索危机前后在线对话的主要主题，展示讨论话题如何因这一政治敏感事件而演变（RQ3）。研究通过整合基于大语言模型（Large Language Model, LLM）的情感分析、双重差分模型（Difference-in-Differences Modeling）和主题建模（Topic Modeling）技术，发现尽管存在基线意识形态和地区差异，公众对特朗普的反应总体上是同情的，而非两极分化。

链接: https://arxiv.org/abs/2501.09950
作者: Qingcheng Zeng,Guanhong Liu,Zhaoqian Xue,Diego Ford,Rob Voigt,Loni Hagen,Lingyao Li
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:On July 13, 2024, at the Trump rally in Pennsylvania, someone attempted to assassinate Republican Presidential Candidate Donald Trump. This attempt sparked a large-scale discussion on social media. We collected posts from X (formerly known as Twitter) one week before and after the assassination attempt and aimed to model the short-term effects of such a ``shock’’ on public opinions and discussion topics. Specifically, our study addresses three key questions: first, we investigate how public sentiment toward Donald Trump shifts over time and across regions (RQ1) and examine whether the assassination attempt itself significantly affects public attitudes, independent of the existing political alignments (RQ2). Finally, we explore the major themes in online conversations before and after the crisis, illustrating how discussion topics evolved in response to this politically charged event (RQ3). By integrating large language model-based sentiment analysis, difference-in-differences modeling, and topic modeling techniques, we find that following the attempt the public response was broadly sympathetic to Trump rather than polarizing, despite baseline ideological and regional disparities.
zh

[NLP-21] Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources COLING

【速读】：该论文试图解决阿根廷土著语言（Indigenous languages）面临消失风险的问题，这些语言的消失将导致世界遗产和文化知识的重大损失。目前，缺乏关于这些语言使用者的统一信息以及相关的计算工具。论文的关键解决方案包括系统化整理阿根廷境内的土著语言，并将其分类为七个语系：马普切语（Mapuche）、图皮-瓜拉尼语（Tupí-Guaraní）、瓜伊库鲁语（Guaycurú）、克丘亚语（Quechua）、马塔科-马塔瓜亚语（Mataco-Mataguaya）、艾马拉语（Aymara）和琼语（Chon）。此外，论文还提供了这些语言的计算资源（computational resources）的初步调查，无论这些资源是否专门针对阿根廷的变体开发。

链接: https://arxiv.org/abs/2501.09943
作者: Belu Ticona,Fernando Carranza,Viviana Cotik
机构: George Mason University, United States; Departamento de Computación, FCEyN, Universidad de Buenos Aires (UBA), Argentina; Departamento de Letras, FFyL, UBA, Argentina; Instituto de Filología y Literaturas Hispánicas “Dr. Amado Alonso”, UBA, Argentina; Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Argentina
类目: Computation and Language (cs.CL)
备注: Accepted to COLING Main 2025

点击查看摘要

Abstract:Argentina has a diverse, yet little-known, Indigenous language heritage. Most of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, no unified information on speakers and computational tools is available for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, along with national demographic data on the country’s Indigenous population. The languages are classified into seven families: Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon. We also provide an introductory survey of the computational resources available for these languages, whether or not they are specifically developed for Argentine varieties.
zh

[NLP-22] Passage Segmentation of Documents for Extractive Question Answering

【速读】：该论文试图解决在开放域问答（open-domain question answering）中，检索增强生成（Retrieval-Augmented Generation, RAG）流程中的分块（chunking）过程未得到足够重视的问题。分块过程对于提升密集段落检索（dense passage retrieval）和端到端RAG流程的性能至关重要。论文提出了一种新的框架——Logits-Guided Multi-Granular Chunker (LGMGC)，该框架通过将长文档分割成具有上下文关联且自包含的多粒度分块，显著改进了检索步骤，并在集成到RAG流程时优于现有的分块方法。实验结果表明，LGMGC在两个基准数据集上均表现出色，验证了其在提升RAG性能中的关键作用。

链接: https://arxiv.org/abs/2501.09940
作者: Zuhong Liu,Charles-Elie Simon,Fabien Caspani
机构: BNP Paribas CIB (法国巴黎银行企业及投资银行); 未知; 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has proven effective in open-domain question answering. However, the chunking process, which is essential to this pipeline, often receives insufficient attention relative to retrieval and synthesis components. This study emphasizes the critical role of chunking in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline. We then introduce the Logits-Guided Multi-Granular Chunker (LGMGC), a novel framework that splits long documents into contextualized, self-contained chunks of varied granularity. Our experimental results, evaluated on two benchmark datasets, demonstrate that LGMGC not only improves the retrieval step but also outperforms existing chunking methods when integrated into a RAG pipeline.
zh

[NLP-23] Steering Large Language Models with Feature Guided Activation Additions

【速读】：该论文旨在解决大型语言模型（LLM）行为控制中的有效性和可靠性问题。现有的激活引导方法（activation steering methods）虽然通过在模型的隐藏状态中添加引导向量来调整模型输出，但这些方法往往缺乏精确性和可解释性。论文提出了一种新的激活引导方法——特征引导激活添加（Feature Guided Activation Additions, FGAA），该方法结合了对比激活添加（Contrastive Activation Addition, CAA）和稀疏自编码器目标引导（Sparse Autoencoder-Targeted Steering, SAE-TS）的见解。FGAA通过在稀疏自编码器（SAE）的潜在空间中操作，并利用优化技术选择所需的SAE特征，构建了精确的引导向量，从而在保持模型输出连贯性的同时，提供了更好的引导效果。实验结果表明，FGAA在Gemma-2-2B和Gemma-2-9B模型上的多种引导任务中优于现有的CAA、SAE解码器引导和SAE-TS方法。此外，研究还揭示了引导规模与模型通用能力之间的重要权衡，这一权衡在所有测试的引导方法中均保持一致。

链接: https://arxiv.org/abs/2501.09929
作者: Samuel Soo,Wesley Teng,Chandrasekaran Balaganesh
机构: Raffles Institution(莱佛士书院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 maintext pages, 14 appendix pages

点击查看摘要

Abstract:Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model’s hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.
zh

[NLP-24] Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLM s SIGMOD2025

【速读】：该论文试图解决传统对话基准（dialogue benchmarks）生成过程中依赖人工创建、忽视知识图谱（Knowledge Graphs, KGs）自动化潜力的问题。传统方法通常从文档中手动生成对话基准，而现有的基于知识图谱的问答基准生成方法虽然自动化程度较高，但不支持对话生成。论文提出的解决方案是Chatty-Gen，一个多阶段检索增强生成平台，能够利用知识图谱自动生成特定领域的高质量对话基准。其关键创新在于将生成过程分解为多个可管理的阶段，并使用断言规则（assertion rules）在阶段间进行自动验证，从而控制中间结果，避免因幻觉（hallucinations）导致的耗时重启。此外，Chatty-Gen通过高效的基于查询的检索方法，减少了对整个知识图谱的预处理需求，仅根据对话上下文检索代表性子图，降低了对昂贵且强大的商业大语言模型（LLMs）的依赖。实验表明，Chatty-Gen在多个真实且大规模的知识图谱上显著优于现有系统，并在不同能力的大语言模型（如GPT-4o、Gemini 1.5、Llama 3和Mistral）上保持一致的模型和系统性能。

链接: https://arxiv.org/abs/2501.09928
作者: Reham Omar,Omij Mangukiya,Essam Mansour
机构: Concordia University(康考迪亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The paper is publsihed in SIGMOD 2025

点击查看摘要

Abstract:Dialogue benchmarks are crucial in training and evaluating chatbots engaging in domain-specific conversations. Knowledge graphs (KGs) represent semantically rich and well-organized data spanning various domains, such as DBLP, DBpedia, and YAGO. Traditionally, dialogue benchmarks have been manually created from documents, neglecting the potential of KGs in automating this process. Some question-answering benchmarks are automatically generated using extensive preprocessing from KGs, but they do not support dialogue generation. This paper introduces Chatty-Gen, a novel multi-stage retrieval-augmented generation platform for automatically generating high-quality dialogue benchmarks tailored to a specific domain using a KG. Chatty-Gen decomposes the generation process into manageable stages and uses assertion rules for automatic validation between stages. Our approach enables control over intermediate results to prevent time-consuming restarts due to hallucinations. It also reduces reliance on costly and more powerful commercial LLMs. Chatty-Gen eliminates upfront processing of the entire KG using efficient query-based retrieval to find representative subgraphs based on the dialogue context. Our experiments with several real and large KGs demonstrate that Chatty-Gen significantly outperforms state-of-the-art systems and ensures consistent model and system performance across multiple LLMs of diverse capabilities, such as GPT-4o, Gemini 1.5, Llama 3, and Mistral.
zh

[NLP-25] Bridging Language Barriers in Healthcare: A Study on Arabic LLM s

【速读】：该论文探讨了开发兼具多语言理解和医学知识的大型语言模型（LLMs）所面临的挑战。研究发现，仅仅通过翻译医学数据并不能保证在目标语言的临床任务中表现出色。实验表明，不同医学任务中训练数据的最佳语言组合存在显著差异。论文指出，较大规模的模型在精心校准语言比例后，能够在母语临床任务中取得更优的表现。此外，研究结果还表明，仅依赖微调（fine-tuning）可能不是将新语言知识融入LLMs的最有效方法，数据密集型和计算密集型的预训练方法仍然是实现多语言医学场景中最佳性能的必要手段。这些发现为构建适用于多语言社区的有效且包容的医疗AI系统提供了重要指导。

链接: https://arxiv.org/abs/2501.09825
作者: Nada Saadi,Tathagata Raha,Clément Christophe,Marco AF Pimentel,Ronnie Rajan,Praveen K Kanithi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the challenges of developing large language models (LLMs) proficient in both multilingual understanding and medical knowledge. We demonstrate that simply translating medical data does not guarantee strong performance on clinical tasks in the target language. Our experiments reveal that the optimal language mix in training data varies significantly across different medical tasks. We find that larger models with carefully calibrated language ratios achieve superior performance on native-language clinical tasks. Furthermore, our results suggest that relying solely on fine-tuning may not be the most effective approach for incorporating new language knowledge into LLMs. Instead, data and computationally intensive pretraining methods may still be necessary to achieve optimal performance in multilingual medical settings. These findings provide valuable guidance for building effective and inclusive medical AI systems for diverse linguistic communities.
zh

[NLP-26] Qwen it detect machine-generated text?

【速读】：该论文旨在解决Coling 2025 GenAI Workshop中的Task 1：二元多语言机器生成文本检测（Binary Multilingual Machine-Generated Text Detection）问题。研究团队探索了掩码语言模型（masked language models）和因果模型（causal models）两种方法。在Subtask A中，团队的最佳模型在F1 Micro（辅助评分）上取得了0.8333的成绩，位列36个团队中的第一名；在F1 Macro（主要评分）上取得了0.8301的成绩，位列第二名。解决方案的关键在于通过对比和优化这两种模型，最终实现了在多语言环境下对机器生成文本的高效检测。

链接: https://arxiv.org/abs/2501.09813
作者: Teodor-George Marchitan,Claudiu Creanga,Liviu P. Dinu
机构: Faculty of Mathematics and Computer Science, University of Bucharest(布加勒斯特大学); Interdisciplinary School of Doctoral Studies, University of Bucharest(布加勒斯特大学); HLT Research Center, University of Bucharest(布加勒斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes the approach of the Unibuc - NLP team in tackling the Coling 2025 GenAI Workshop, Task 1: Binary Multilingual Machine-Generated Text Detection. We explored both masked language models and causal models. For Subtask A, our best model achieved first-place out of 36 teams when looking at F1 Micro (Auxiliary Score) of 0.8333, and second-place when looking at F1 Macro (Main Score) of 0.8301
zh

[NLP-27] Enhancing Generalization in Chain of Thought Reasoning for Smaller Models

【速读】：该论文试图解决在小型语言模型（LLM）中实现链式思维（Chain-of-Thought, CoT）推理的挑战，特别是在实际应用中需要高泛化能力的情况下。现有的CoT知识蒸馏方法在小型LLM中往往表现出过于保守的记忆化现象，导致泛化信心不足。由于完全保留教师模型的CoT能力是不可能的，论文假设对抗性CoT微调对于开发具有鲁棒CoT泛化能力的小型LLM至关重要。为此，论文提出了PRompt-Assisted Domain-Adversarial fine-tuning (PRADA)框架，该框架通过整合多样化的CoT领域，实现了两个关键改进：(1) 通过领域对抗微调恢复在蒸馏过程中通常丢失的领域不变特征；(2) 通过采用领域对抗方法增强CoT提示工程的领域适应性。理论分析和实验结果表明，PRADA在广泛任务中显著优于现有技术，并且小型LLM在使用PRADA时能够更好地与领域知识对齐，从而提高了方法的可解释性。

链接: https://arxiv.org/abs/2501.09804
作者: Maxwell J. Yin,Dingyi Jiang,Yongbing Chen,Boyu Wang,Charles Ling
机构: Western University, London, ON, Canada(西安大略大学); Wenzhou Academy of Agricultural Sciences, Wenzhou, China(温州市农业科学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning in smaller language models is a challenging natural language process problem yet highly desirable in many real-life applications. Existing CoT knowledge distillation methods often suffer from overly conservative memorization in smaller LLMs, leading to low generalization confidence. As fully preserving the CoT ability of teacher model is impossible, we hypothesize that adversarial CoT fine-tuning is crucial for developing smaller LLM with robust CoT generalization. To this end, we propose \textitPRompt-Assisted Domain-Adversarial fine-tuning (PRADA), a principled fine-tuning framework that integrates diverse CoT domains. Specifically, PRADA pioneers two CoT improvements in smaller LLM: (1) Recovering the domain-invariant feature insight which typically lost during distillation with domain adversarial fine-tuning; (2) Enhancing the domain adaptability of CoT prompt engineering by employing domain-adversarial approaches. We theoretically demonstrate the effectiveness of our approach and empirically show that it significantly outperforms the state of the arts in a wide range of tasks. Moreover, our empirical findings reveal that the smaller LLM, when leveraging PRADA, aligns closely with domain knowledge, thereby improving the explainability of our approach.
zh

[NLP-28] Conversational Text Extraction with Large Language Models Using Retrieval-Augmented Systems

【速读】：该论文旨在解决如何通过对话式界面高效地从PDF文档中提取文本并增强用户交互的问题。解决方案的关键在于利用大语言模型（LLMs）和检索增强生成（RAG）技术。具体而言，系统通过处理用户上传的PDF文档，使用句子嵌入（sentence embeddings）创建文档特定的向量存储（vector store），从而实现高效检索相关段落。随后，LLM利用检索到的信息进行对话式交互，生成上下文感知的全面回答。该方法在文本提取和摘要任务中表现出与现有最先进技术相当的ROUGE值，为研究人员和学生提供了一个通过直观问答界面高效提取知识和获取文档洞察的工具。

链接: https://arxiv.org/abs/2501.09801
作者: Soham Roy,Mitul Goswami,Nisharg Nargund,Suneeta Mohanty,Prasant Kumar Pattnaik
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study introduces a system leveraging Large Language Models (LLMs) to extract text and enhance user interaction with PDF documents via a conversational interface. Utilizing Retrieval-Augmented Generation (RAG), the system provides informative responses to user inquiries while highlighting relevant passages within the PDF. Upon user upload, the system processes the PDF, employing sentence embeddings to create a document-specific vector store. This vector store enables efficient retrieval of pertinent sections in response to user queries. The LLM then engages in a conversational exchange, using the retrieved information to extract text and generate comprehensive, contextually aware answers. While our approach demonstrates competitive ROUGE values compared to existing state-of-the-art techniques for text extraction and summarization, we acknowledge that further qualitative evaluation is necessary to fully assess its effectiveness in real-world applications. The proposed system gives competitive ROUGE values as compared to existing state-of-the-art techniques for text extraction and summarization, thus offering a valuable tool for researchers, students, and anyone seeking to efficiently extract knowledge and gain insights from documents through an intuitive question-answering interface.
zh

[NLP-29] Computing Optimization-Based Prompt Injections Against Closed-Weights Models By Misusing a Fine-Tuning API

【速读】：该论文探讨了封闭权重大语言模型（LLMs）面临的新威胁，即攻击者可以通过优化提示注入（optimization-based prompt injections）来利用远程微调接口返回的损失信息（loss-like information）来指导对抗性提示的搜索。关键解决方案在于，攻击者利用LLM供应商提供的微调接口，通过贪婪搜索算法（greedy search algorithm）对对抗性提示进行离散优化，从而成功实施攻击。实验分析表明，Gemini微调API返回的损失信息为这种优化提供了有效信号，攻击成功率在Google的Gemini系列LLMs上达到65%至82%。这一研究揭示了实用性与安全性之间的经典权衡：微调接口虽然为开发者提供了实用功能，但也使LLMs暴露于强大的攻击之下。

链接: https://arxiv.org/abs/2501.09798
作者: Andrey Labunets,Nishit V. Pandya,Ashish Hooda,Xiaohan Fu,Earlence Fernandes
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google’s Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.
zh

[NLP-30] Sentiment Analysis in Twitter Social Network Centered on Cryptocurrencies Using Machine Learning

【速读】：该论文试图解决的问题是分析伊朗用户在Twitter社交网络上对加密货币（Cryptocurrency）的情绪和观点，并提供一个最佳的情绪分类模型。由于加密货币的去中心化特性对传统货币体系和资本市场可能产生重大影响，理解公众意见尤为重要。Twitter作为一个主要的讨论平台，能够快速且低成本地反映社区意见。论文的关键解决方案是使用自然语言处理（NLP）技术，包括词袋模型（Bag of Words, BOW）和FastText进行文本向量化，并结合经典机器学习算法（如KNN、SVM和Adaboost）以及深度学习模型（如LSTM和BERT）进行情绪分类。最终，BERT模型在情绪分类任务中表现最佳，准确率达到83.50%。这一解决方案为经济领域的管理者和决策者提供了从公众角度理解加密货币现象的工具，并有助于更好地管理这一新兴技术。

链接: https://arxiv.org/abs/2501.09777
作者: Vahid Amiri,Mahmood Ahmadi
机构: Computer and Information Technology Department, Razi University (拉齐大学计算机与信息技术系); Razi University (拉齐大学)
类目: Computation and Language (cs.CL)
备注: 6 pages and 5 figures

点击查看摘要

Abstract:Cryptocurrency is a digital currency that uses blockchain technology with secure encryption. Due to the decentralization of these currencies, traditional monetary systems and the capital market of each they, can influence a society. Therefore, due to the importance of the issue, the need to understand public opinion and analyze people’s opinions in this regard increases. To understand the opinions and views of people about different topics, you can take help from social networks because they are a rich source of opinions. The Twitter social network is one of the main platforms where users discuss various topics, therefore, in the shortest time and with the lowest cost, the opinion of the community can be measured on this social network. Twitter Sentiment Analysis (TSA) is a field that analyzes the sentiment expressed in tweets. Considering that most of TSA’s research efforts on cryptocurrencies are focused on English language, the purpose of this paper is to investigate the opinions of Iranian users on the Twitter social network about cryptocurrencies and provide the best model for classifying tweets based on sentiment. In the case of automatic analysis of tweets, managers and officials in the field of economy can gain knowledge from the general public’s point of view about this issue and use the information obtained in order to properly manage this phenomenon. For this purpose, in this paper, in order to build emotion classification models, natural language processing techniques such as bag of words (BOW) and FastText for text vectorization and classical machine learning algorithms including KNN, SVM and Adaboost learning methods Deep including LSTM and BERT model were used for classification, and finally BERT linguistic model had the best accuracy with 83.50%.
zh

[NLP-31] Multiple Choice Questions: Reasoning Makes Large Language Models (LLM s) More Self-Confident Even When They Are Wrong

【速读】：该论文旨在研究大语言模型（LLMs）在回答多选题（MCQ）时的置信度（confidence）如何受到回答方式的影响，特别是直接回答问题与提供推理链（chain of thought）后再回答的区别。论文通过评估七个不同模型在广泛主题上的表现，发现LLMs在提供推理后再回答时，无论答案是否正确，其置信度都更高。这一现象可能与推理过程修改了答案选择的概率有关，因为LLMs基于输入问题和推理支持来预测答案。因此，论文指出LLMs的估计概率存在内在局限性，需在评估过程中加以理解。这一发现与人类行为相似，即解释答案会增加对其正确性的信心。

链接: https://arxiv.org/abs/2501.09775
作者: Tairan Fu,Javier Conde,Gonzalo Martínez,María Grandury,Pedro Reviriego
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Universidad Politécnica de Madrid (马德里理工大学); Universidad Carlos III de Madrid (马德里卡洛斯三世大学); SomosNLP/Universidad Politécnica de Madrid (SomosNLP/马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.
zh

[NLP-32] Can Large Language Models Predict the Outcome of Judicial Decisions?

【速读】：该论文试图解决在低资源语言（如阿拉伯语）中，特别是在法律判决预测（Legal Judgment Prediction, LJP）任务中，大语言模型（Large Language Models, LLMs）应用不足的问题。解决方案的关键在于开发了一个阿拉伯语LJP数据集，该数据集从沙特商业法庭判决中收集并预处理。研究团队对包括LLaMA-3.2-3B和LLaMA-3.1-8B在内的开源大语言模型进行了基准测试，测试了零样本（zero-shot）、单样本（one-shot）和使用QLoRA进行微调（fine-tuning）的不同配置。此外，研究采用了结合定量指标（如BLEU和ROUGE）和定性评估（如连贯性、法律语言和清晰度）的综合评估框架。结果表明，在任务特定情境下，经过微调的较小模型能够达到与较大模型相当的性能，同时显著提高了资源效率。研究还探讨了提示工程（prompt engineering）和微调对模型输出的影响，为性能变异性和指令敏感性提供了见解。通过公开数据集、实现代码和模型，该研究为未来阿拉伯语法律自然语言处理（NLP）研究奠定了坚实基础。

链接: https://arxiv.org/abs/2501.09768
作者: Mohamed Bayan Kmainasi,Ali Ezzat Shahroor,Amani Al-Ghraibah
机构: Qatar University(卡塔尔大学); OUC in partnership with LJMU(OUC与利物浦约翰摩尔斯大学合作); Al-Ahliyya Amman University(阿赫利亚安曼大学); Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown exceptional capabilities in Natural Language Processing (NLP) across diverse domains. However, their application in specialized tasks such as Legal Judgment Prediction (LJP) for low-resource languages like Arabic remains underexplored. In this work, we address this gap by developing an Arabic LJP dataset, collected and preprocessed from Saudi commercial court judgments. We benchmark state-of-the-art open-source LLMs, including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations such as zero-shot, one-shot, and fine-tuning using QLoRA. Additionally, we used a comprehensive evaluation framework combining quantitative metrics (BLEU and ROUGE) and qualitative assessments (Coherence, legal language, clarity). Our results demonstrate that fine-tuned smaller models achieve comparable performance to larger models in task-specific contexts while offering significant resource efficiency. Furthermore, we investigate the effects of prompt engineering and fine-tuning on model outputs, providing insights into performance variability and instruction sensitivity. By making the dataset, implementation code, and models publicly available, we establish a robust foundation for future research in Arabic legal NLP.
zh

[NLP-33] LeMo: Enabling LEss Token Involvement for MOre Context Fine-tuning

【速读】：该论文旨在解决大语言模型（LLM）在处理长上下文应用时面临的内存消耗过高的问题，特别是激活内存（activation memory）的瓶颈。尽管现有的参数高效微调方法能够扩展上下文长度，但它们未能有效解决激活内存的限制。论文提出了一种名为LeMo的微调系统，通过探索和利用长上下文场景中固有的“上下文令牌稀疏性”（Contextual Token Sparsity）机制，减少冗余令牌的参与，从而优化内存使用。LeMo的关键技术包括：（1）令牌消除（Token Elimination），动态识别并排除冗余令牌；（2）模式预测（Pattern Prediction），利用训练好的预测器以最小开销近似令牌稀疏模式；（3）内核优化（Kernel Optimization），采用无置换和分段策略提升系统性能。实验表明，LeMo在减少内存消耗和加速模型训练方面显著优于现有技术。

链接: https://arxiv.org/abs/2501.09767
作者: Tuowei Wang,Xingyu Chen,Kun Li,Ting Cao,Ju Ren,Yaoxue Zhang
机构: Tsinghua University(清华大学); HUST(华中科技大学); Microsoft Research(微软研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The escalating demand for long-context applications has intensified the necessity of extending the LLM context windows. Despite recent fine-tuning approaches successfully expanding context lengths, their high memory footprints, especially for activations, present a critical practical limitation. Current parameter-efficient fine-tuning methods prioritize reducing parameter update overhead over addressing activation memory constraints. Similarly, existing sparsity mechanisms improve computational efficiency but overlook activation memory optimization due to the phenomenon of Shadowy Activation. In this paper, we propose LeMo, the first LLM fine-tuning system that explores and exploits a new token-level sparsity mechanism inherent in long-context scenarios, termed Contextual Token Sparsity. LeMo minimizes redundant token involvement by assessing the informativeness of token embeddings while preserving model accuracy. Specifically, LeMo introduces three key techniques: (1) Token Elimination, dynamically identifying and excluding redundant tokens across varying inputs and layers. (2) Pattern Prediction, utilizing well-trained predictors to approximate token sparsity patterns with minimal overhead. (3) Kernel Optimization, employing permutation-free and segment-based strategies to boost system performance. We implement LeMo as an end-to-end fine-tuning system compatible with various LLM architectures and other optimization techniques. Comprehensive evaluations demonstrate that LeMo reduces memory consumption by up to 1.93x and achieves up to 1.36x speedups, outperforming state-of-the-art fine-tuning systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.09767 [cs.CL] (or arXiv:2501.09767v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.09767 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-34] Boosting Tool Use of Large Language Models via Iterative Reinforced Fine-Tuning

【速读】：该论文试图解决大语言模型（LLMs）在复杂任务中使用外部工具时表现不佳的问题。具体来说，尽管通过模拟真实世界生成工具使用数据可以有效提升模型能力，但随着数据规模的增加，训练效果的提升显著衰减。主要原因是模型在复杂场景中的表现不足，阻碍了通过监督微调（SFT）从数据中学习的能力。为解决这一问题，论文提出了一种迭代强化微调策略，通过反馈机制识别模型在复杂场景中的不足，并利用蒙特卡洛树搜索（Monte Carlo Tree Search）收集细粒度的偏好对来精确定位这些不足。随后，通过偏好优化更新策略模型，使其与真实情况对齐并避免与不足对齐。此外，论文还提出了一种从易到难的预热SFT策略，以帮助模型更好地从具有挑战性的数据中学习。实验结果表明，该方法在复杂工具使用场景中显著提升了训练效果，超越了同参数规模的模型以及许多更大的开源和闭源模型。

链接: https://arxiv.org/abs/2501.09766
作者: Yirong Zeng,Xiao Ding,Yuxian Wang,Weiwen Liu,Wu Ning,Yutai Hou,Xu Huang,Bing Qin,Ting Liu
机构: Harbin Institute of Technology SCIR Lab(哈尔滨工业大学社会计算与信息检索实验室); Huawei Technologies Co., Ltd(华为技术有限公司); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities. Effectively leveraging this potential for complex tasks hinges crucially on improving their ability to use tools. Synthesizing tool use data by simulating the real world is an effective approach. Nevertheless, our investigation reveals that training gains significantly decay as the scale of these data increases. The primary factor is the model’s poor performance (a.k.a deficiency) in complex scenarios, which hinders learning from data using SFT. Driven by this objective, we propose an iterative reinforced fine-tuning strategy to continually guide the model to alleviate it. Specifically, we first identify deficiency-related data based on feedback from the policy model, then perform a Monte Carlo Tree Search to collect fine-grained preference pairs to pinpoint deficiencies. Subsequently, we update the policy model using preference optimization to align with ground truth and misalign with deficiencies. This process can be iterated. Moreover, before the iteration, we propose an easy-to-hard warm-up SFT strategy to facilitate learning from challenging data. The experiments demonstrate our models go beyond the same parametric models, outperforming many larger open-source and closed-source models. Additionally, it has achieved notable training gains in complex tool use scenarios.
zh

[NLP-35] Enhancing the De-identification of Personally Identifiable Information in Educational Data

【速读】：该论文旨在解决在教育技术中保护个人身份信息（PII, Personally Identifiable Information）的问题，特别是如何在匿名化敏感信息的同时保持教育数据的实用性。论文的核心解决方案是探索并评估GPT-4o-mini模型在PII检测任务中的表现。通过对比提示（prompting）和微调（fine-tuning）两种方法，研究发现微调后的GPT-4o-mini模型在CRAPII和TSCC两个公开数据集上表现出色，尤其是在召回率（recall）和精确度（precision）方面显著优于现有的框架（如Microsoft Presidio和Azure AI Language）。此外，微调后的模型在计算成本上大幅降低，且在不同文化背景和性别上的表现一致准确。这些结果表明，微调后的GPT-4o-mini模型是一种高效且成本低廉的PII检测工具，能够在保护隐私的同时保持数据的实用性。

链接: https://arxiv.org/abs/2501.09765
作者: Y. Shen,Z. Ji,J. Lin,K. R. Koedginer
机构: Carnegie Mellon University(卡内基梅隆大学); The University of Hong Kong(香港大学); Monash University(莫纳什大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figure; This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Protecting Personally Identifiable Information (PII), such as names, is a critical requirement in learning technologies to safeguard student and teacher privacy and maintain trust. Accurate PII detection is an essential step toward anonymizing sensitive information while preserving the utility of educational data. Motivated by recent advancements in artificial intelligence, our study investigates the GPT-4o-mini model as a cost-effective and efficient solution for PII detection tasks. We explore both prompting and fine-tuning approaches and compare GPT-4o-mini’s performance against established frameworks, including Microsoft Presidio and Azure AI Language. Our evaluation on two public datasets, CRAPII and TSCC, demonstrates that the fine-tuned GPT-4o-mini model achieves superior performance, with a recall of 0.9589 on CRAPII. Additionally, fine-tuned GPT-4o-mini significantly improves precision scores (a threefold increase) while reducing computational costs to nearly one-tenth of those associated with Azure AI Language. Furthermore, our bias analysis reveals that the fine-tuned GPT-4o-mini model consistently delivers accurate results across diverse cultural backgrounds and genders. The generalizability analysis using the TSCC dataset further highlights its robustness, achieving a recall of 0.9895 with minimal additional training data from TSCC. These results emphasize the potential of fine-tuned GPT-4o-mini as an accurate and cost-effective tool for PII detection in educational data. It offers robust privacy protection while preserving the data’s utility for research and pedagogical analysis. Our code is available on GitHub: this https URL
zh

[NLP-36] OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

【速读】：该论文试图解决基于大语言模型（Large Language Models）的机器写作中，检索增强生成（Retrieval-Augmented Generation）方法存在的局限性问题。具体而言，传统的检索信息往往缺乏深度、实用性，且存在冗余，导致生成的文章内容浅显、重复且缺乏原创性。为解决这些问题，论文提出了OmniThink框架，其核心思想是模拟人类学习者在逐步深入理解主题时的认知行为，通过迭代扩展和反思的过程来生成内容。实验结果表明，OmniThink在不损害文章连贯性和深度的前提下，显著提高了生成文章的知识密度。人类评估和专家反馈进一步证实了OmniThink在生成长篇内容时应对实际挑战的潜力。

链接: https://arxiv.org/abs/2501.09751
作者: Zekun Xi,Wenbiao Yin,Jizhan Fang,Jialong Wu,Runnan Fang,Ningyu Zhang,Jiang Yong,Pengjun Xie,Fei Huang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model’s predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, utility, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, repetitive, and unoriginal outputs. To address these issues, we propose OmniThink, a machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they progressively deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles.
zh

计算机视觉

[CV-0] FaceXBench: Evaluating Multimodal LLM s on Face Understanding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在复杂人脸理解任务中的能力尚未得到系统研究的问题。为了填补这一研究空白，作者提出了FaceXBench，一个全面的基准测试工具，用于评估MLLMs在复杂人脸理解任务中的表现。FaceXBench包含5000个多模态选择题，这些题目来源于25个公共数据集和一个新创建的数据集FaceXAPI，涵盖了6大类14项任务，包括偏见与公平性、人脸认证、识别、分析、定位和工具检索等。通过对26个开源MLLMs和2个专有模型的广泛评估，作者揭示了复杂人脸理解任务中的独特挑战，并分析了模型在零样本、上下文任务描述和思维链提示三种评估设置下的表现。研究结果表明，即使是先进的模型如GPT-4o和GeminiPro 1.5，在复杂人脸理解任务中仍有显著的改进空间。FaceXBench将成为开发具备复杂人脸理解能力的MLLMs的重要资源。

链接: https://arxiv.org/abs/2501.10360
作者: Kartik Narayan,Vibashan VS,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs’ face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding. Code: this https URL
zh

[CV-1] Zero-Shot Monocular Scene Flow Estimation in the Wild

【速读】：该论文试图解决场景流（scene flow）预测中的泛化能力不足问题。尽管场景流在许多应用中具有广泛潜力，但由于现有预测模型在跨数据集上的泛化能力较差，导致其在实际应用中未被广泛采用。论文提出了三个关键挑战的解决方案：首先，开发了一种联合估计几何和运动的方法，以提高预测的准确性；其次，通过一种数据生成方法缓解了场景流数据稀缺的问题，生成了100万个跨多样合成场景的标注训练样本；最后，评估了不同的场景流参数化方法，并采用了一种自然且有效的参数化方式。最终，该模型在3D端点误差（3D end-point error）上优于现有方法和大规模模型基线，并在DAVIS的随意捕捉视频和RoboTAP的机器人操作场景中展示了零样本泛化能力。总体而言，该研究使场景流预测在真实场景中更具实用性。

链接: https://arxiv.org/abs/2501.10357
作者: Yiqing Liang,Abhishek Badki,Hang Su,James Tompkin,Orazio Gallo
机构: NVIDIA Research; Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for this http URL, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.
zh

[CV-2] 3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results

【速读】：该论文旨在探讨和解决无人水面艇（USV）和水下环境中的计算机视觉问题，重点关注第三届海上计算机视觉研讨会（MaCVi 2025）中的挑战。通过对超过700份提交的统计和定性分析，论文提供了对这些挑战的全面评估。解决方案的关键在于公开所有数据集、评估代码和排行榜，以便研究人员能够访问和利用这些资源进行进一步的研究和开发。

链接: https://arxiv.org/abs/2501.10343
作者: Benjamin Kiefer,Lojze Žust,Jon Muhovič,Matej Kristan,Janez Perš,Matija Teršek,Uma Mudenagudi Chaitra Desai,Arnold Wiliem,Marten Kreis,Nikhil Akalwadi,Yitong Quan,Zhiqiang Zhong,Zhe Zhang,Sujie Liu,Xuran Chen,Yang Yang,Matej Fabijanić,Fausto Ferreira,Seongju Lee,Junseok Lee,Kyoobin Lee,Shanliang Yao,Runwei Guan,Xiaoyu Huang,Yi Ni,Himanshu Kumar,Yuan Feng,Yi-Ching Cheng,Tzu-Yu Lin,Chia-Ming Lee,Chih-Chung Hsu,Jannik Sheikh,Andreas Michel,Wolfgang Gross,Martin Weinmann,Josip Šarić,Yipeng Lin,Xiang Yang,Nan Jiang,Yutang Lu,Fei Feng,Ali Awad,Evan Lucas,Ashraf Saleem,Ching-Heng Cheng,Yu-Fan Lin,Tzu-Yu Lin,Chih-Chung Hsu
机构: LOOKOUT; University of Tuebingen(蒂宾根大学); University of Ljubljana(卢布尔雅那大学); Luxonis; Shield AI; Queensland University of Technology(昆士兰科技大学); Center of Excellence in Visual Intelligence, KLE Technological University(KLE科技大学视觉智能卓越中心); Nanjing University of Science and Technology(南京理工大学); University of Zagreb Faculty of Electrical Engineering and Computing(萨格勒布大学电气工程与计算学院); Gwangju Institute of Science and Technology (GIST)(光州科学技术院); University of Liverpool(利物浦大学); Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Xi’an Jiaotong-Liverpool University(西交利物浦大学); Independent Researcher(独立研究员); Dalian Maritime University, School of Marine Engineering(大连海事大学海洋工程学院); National Cheng Kung University(国立成功大学); Fraunhofer IOSB(弗劳恩霍夫IOSB); Yancheng Institute of Technology(盐城工学院); Nanjing University(南京大学); Beijing University of Posts and Telecommunications(北京邮电大学); Michigan Technological University(密歇根理工大学); Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Part of the MaCVi 2025 workshop

点击查看摘要

Abstract:The 3rd Workshop on Maritime Computer Vision (MaCVi) 2025 addresses maritime computer vision for Unmanned Surface Vehicles (USV) and underwater. This report offers a comprehensive overview of the findings from the challenges. We provide both statistical and qualitative analyses, evaluating trends from over 700 submissions. All datasets, evaluation code, and the leaderboard are available to the public at this https URL.
zh

[CV-3] DiffStereo: High-Frequency Aware Diffusion Model for Stereo Image Restoration

【速读】：该论文试图解决扩散模型（Diffusion Models, DMs）在立体图像恢复（stereo image restoration）中的应用问题。具体来说，现有扩散模型在图像恢复中表现良好，但在立体图像恢复中面临两个主要挑战：一是需要重建两幅图像，增加了计算成本；二是现有的潜在扩散模型（latent DMs）通常关注语义信息，并在潜在压缩过程中去除高频细节，而这些高频细节恰恰是图像恢复的关键。为解决这些问题，论文提出了一种高频感知扩散模型（DiffStereo），首次将扩散模型应用于立体图像恢复领域。DiffStereo的关键在于首先学习高质量图像的高频潜在表示（Latent High-Frequency Representations, LHFR），然后在学习到的空间中对扩散模型进行训练，以估计立体图像的LHFR。这些LHFR随后被融合到一个基于Transformer的立体图像恢复网络中，提供对应高质量图像的有益高频信息。此外，LHFR的分辨率与输入图像保持一致，保留了固有的纹理信息，同时通过通道压缩减轻了扩散模型的计算负担。论文还设计了一种位置编码方案，将LHFR集成到恢复网络中，从而在不同深度的恢复网络中提供独特的指导。实验结果表明，DiffStereo在立体超分辨率、去模糊和低光增强任务中，相比现有方法在重建精度和感知质量上均有显著提升。

链接: https://arxiv.org/abs/2501.10325
作者: Huiyun Cao,Yuan Shi,Bin Xia,Xiaoyu Jin,Wenming Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Diffusion models (DMs) have achieved promising performance in image restoration but haven’t been explored for stereo images. The application of DM in stereo image restoration is confronted with a series of challenges. The need to reconstruct two images exacerbates DM’s computational cost. Additionally, existing latent DMs usually focus on semantic information and remove high-frequency details as redundancy during latent compression, which is precisely what matters for image restoration. To address the above problems, we propose a high-frequency aware diffusion model, DiffStereo for stereo image restoration as the first attempt at DM in this domain. Specifically, DiffStereo first learns latent high-frequency representations (LHFR) of HQ images. DM is then trained in the learned space to estimate LHFR for stereo images, which are fused into a transformer-based stereo image restoration network providing beneficial high-frequency information of corresponding HQ images. The resolution of LHFR is kept the same as input images, which preserves the inherent texture from distortion. And the compression in channels alleviates the computational burden of DM. Furthermore, we devise a position encoding scheme when integrating the LHFR into the restoration network, enabling distinctive guidance in different depths of the restoration network. Comprehensive experiments verify that by combining generative DM and transformer, DiffStereo achieves both higher reconstruction accuracy and better perceptual quality on stereo super-resolution, deblurring, and low-light enhancement compared with state-of-the-art methods.
zh

[CV-4] New Fashion Products Performance Forecasting: A Survey on Evolutions Models and Emerging Trends

【速读】：该论文试图解决快时尚行业（fast fashion industry）中由于过度生产、浪费和有害化学品使用导致的环境负担问题。具体而言，论文聚焦于新时尚产品性能预测（New Fashion Products Performance Forecasting, NFPPF）这一关键挑战，旨在通过基于学习的预测分析（learning-based predictive analytics）来优化生产流程、减少生态足迹，并推动可持续发展。解决方案的关键在于利用多模态信息（multimodal information）和先进的数据集，结合多种学习方法，以应对消费者偏好动态变化、文化变迁和突发事件对时尚趋势的复杂影响。论文还提出了首个涵盖NFPPF学习全景的分类法，并系统性地总结了现有方法和未来研究方向。

链接: https://arxiv.org/abs/2501.10324
作者: Andrea Avogaro,Luigi Capogrosso,Andrea Toaiari,Franco Fummi,Marco Cristani
机构: Dept. of Engineering for Innovation Medicine, University of Verona (创新医学工程系，维罗纳大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the Springer Nature Computer Science journal

点击查看摘要

Abstract:The fast fashion industry’s insatiable demand for new styles and rapid production cycles has led to a significant environmental burden. Overproduction, excessive waste, and harmful chemicals have contributed to the negative environmental impact of the industry. To mitigate these issues, a paradigm shift that prioritizes sustainability and efficiency is urgently needed. Integrating learning-based predictive analytics into the fashion industry represents a significant opportunity to address environmental challenges and drive sustainable practices. By forecasting fashion trends and optimizing production, brands can reduce their ecological footprint while remaining competitive in a rapidly changing market. However, one of the key challenges in forecasting fashion sales is the dynamic nature of consumer preferences. Fashion is acyclical, with trends constantly evolving and resurfacing. In addition, cultural changes and unexpected events can disrupt established patterns. This problem is also known as New Fashion Products Performance Forecasting (NFPPF), and it has recently gained more and more interest in the global research landscape. Given its multidisciplinary nature, the field of NFPPF has been approached from many different angles. This comprehensive survey wishes to provide an up-to-date overview that focuses on learning-based NFPPF strategies. The survey is based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodological flow, allowing for a systematic and complete literature review. In particular, we propose the first taxonomy that covers the learning panorama for NFPPF, examining in detail the different methodologies used to increase the amount of multimodal information, as well as the state-of-the-art available datasets. Finally, we discuss the challenges and future directions.
zh

[CV-5] HiMix: Reducing Computational Complexity in Large Vision-Language Models

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在实际应用中因计算复杂度过高而受限的问题。具体而言，论文指出模型计算中涉及的冗余视觉序列是导致计算复杂度增加的主要瓶颈之一。为解决这一问题，论文提出了一种新颖的分层视觉-语言交互机制，称为“分层视觉注入混合注意力机制”（Hierarchical Vision injection for Mixture Attention, HiMix）。该机制的关键在于，仅对语言序列进行完整的正向传播，而视觉序列则在每个语言解码器层的特定阶段与语言序列进行交互。通过这种方式，HiMix显著降低了计算复杂度，同时保持了模型的性能。实验结果表明，HiMix在多个LVLM模型中实现了语言解码器计算成本10倍的降低，且性能损失极小。这一方法为视觉-语言理解领域提供了新的研究视角。

链接: https://arxiv.org/abs/2501.10318
作者: Xuange Zhang,Dengjie Li,Bo Liu,Zenghao Bao,Yao Zhou,Baisong Yang,Zhongying Liu,Yujie Zhong,Zheng Zhao,Tongtong Yuan
机构: Beijing University of Technology(北京工业大学); Meituan Inc.(美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel hierarchical vision-language interaction mechanism called Hierarchical Vision injection for Mixture Attention (HiMix). In HiMix, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, HiMix achieves a 10x reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. This highlights the advantages of our method, and we hope our research brings new perspectives to the field of vision-language understanding. Project Page: this https URL
zh

[CV-6] GSTAR: Gaussian Surface Tracking and Reconstruction

【速读】：该论文试图解决在动态场景中使用3D高斯分布（3D Gaussians）进行表面重建和跟踪时面临的挑战，特别是由于复杂拓扑变化（如表面出现、消失或分裂）导致的跟踪困难。为了解决这些问题，作者提出了GSTAR方法，该方法通过将高斯分布绑定到网格面（mesh faces）来表示动态物体，并在拓扑变化时自适应地解绑高斯分布，从而实现精确的注册和新表面的生成。GSTAR的关键创新在于其能够在拓扑一致的区域保持网格拓扑并使用高斯分布进行跟踪，而在拓扑变化的区域则通过解绑和优化高斯分布来生成新的表面。此外，作者还引入了一种基于表面的场景流（surface-based scene flow）方法，为帧间跟踪提供了鲁棒的初始化。实验结果表明，GSTAR能够有效地跟踪和重建动态表面，适用于多种应用场景。

链接: https://arxiv.org/abs/2501.10283
作者: Chengwei Zheng,Lixin Xue,Juan Zarate,Jie Song
机构: ETH Zürich(苏黎世联邦理工学院); HKUST(GZ)(香港科技大学广州校区); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting techniques have enabled efficient photo-realistic rendering of static scenes. Recent works have extended these approaches to support surface reconstruction and tracking. However, tracking dynamic surfaces with 3D Gaussians remains challenging due to complex topology changes, such as surfaces appearing, disappearing, or splitting. To address these challenges, we propose GSTAR, a novel method that achieves photo-realistic rendering, accurate surface reconstruction, and reliable 3D tracking for general dynamic scenes with changing topology. Given multi-view captures as input, GSTAR binds Gaussians to mesh faces to represent dynamic objects. For surfaces with consistent topology, GSTAR maintains the mesh topology and tracks the meshes using Gaussians. In regions where topology changes, GSTAR adaptively unbinds Gaussians from the mesh, enabling accurate registration and the generation of new surfaces based on these optimized Gaussians. Additionally, we introduce a surface-based scene flow method that provides robust initialization for tracking between frames. Experiments demonstrate that our method effectively tracks and reconstructs dynamic surfaces, enabling a range of applications. Our project page with the code release is available at this https URL.
zh

[CV-7] MutualForce: Mutual-Aware Enhancement for 4D Radar-LiDAR 3D Object Detection ICASSP2025

【速读】：该论文旨在解决自动驾驶领域中雷达（Radar）和激光雷达（LiDAR）点云融合时存在的模态不对齐（modality misalignment）和特征提取过程中的信息丢失问题。为了解决这些问题，作者提出了一种4D雷达-LiDAR框架，通过相互增强两者的表示来提升性能。解决方案的关键在于：首先，利用雷达的指示性特征（indicative features）来指导雷达和LiDAR的几何特征学习；其次，通过LiDAR的形状信息来丰富雷达的鸟瞰图（BEV）特征，从而弥补两者之间的稀疏性差异。通过在View-of-Delft（VoD）数据集上的大量实验，该方法在整体区域和驾驶走廊内的平均精度（mAP）分别达到了71.76%和86.36%，特别是在车辆检测方面，平均精度（AP）提升了4.17%和4.20%。

链接: https://arxiv.org/abs/2501.10266
作者: Xiangyuan Peng,Huawei Sun,Kay Bierzynski,Anton Fischbacher,Lorenzo Servadei,Robert Wille
机构: Technical University of Munich, Munich, Germany(慕尼黑工业大学); Infineon Technologies AG, Neubiberg, Germany(英飞凌科技公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Radar and LiDAR have been widely used in autonomous driving as LiDAR provides rich structure information, and radar demonstrates high robustness under adverse weather. Recent studies highlight the effectiveness of fusing radar and LiDAR point clouds. However, challenges remain due to the modality misalignment and information loss during feature extractions. To address these issues, we propose a 4D radar-LiDAR framework to mutually enhance their representations. Initially, the indicative features from radar are utilized to guide both radar and LiDAR geometric feature learning. Subsequently, to mitigate their sparsity gap, the shape information from LiDAR is used to enrich radar BEV features. Extensive experiments on the View-of-Delft (VoD) dataset demonstrate our approach’s superiority over existing methods, achieving the highest mAP of 71.76% across the entire area and 86.36% within the driving corridor. Especially for cars, we improve the AP by 4.17% and 4.20% due to the strong indicative features and symmetric shapes.
zh

[CV-8] Disharmony: Forensics using Reverse Lighting Harmonization

【速读】：该论文旨在解决图像编辑区域检测的问题，特别是针对通过深度学习（deep learning）方法生成或编辑的图像。现有的取证模型（forensic models）往往忽视了检测与背景和谐融合的对象（harmonized objects），而本文提出的Disharmony Network通过结合和谐化数据（harmonization data）和分割模型（segmentation model），能够有效识别这些编辑区域。该模型利用了一个包含多种和谐化技术的聚合数据集，显著提升了在背景中识别和谐化对象的能力，并展示了在检测多种编辑形式（如虚拟试穿任务）方面的潜力。

链接: https://arxiv.org/abs/2501.10212
作者: Philip Wootaek Shin,Jack Sampson,Vijaykrishnan Narayanan,Andres Marquez,Mahantesh Halappanavar
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Content generation and manipulation approaches based on deep learning methods have seen significant advancements, leading to an increased need for techniques to detect whether an image has been generated or edited. Another area of research focuses on the insertion and harmonization of objects within images. In this study, we explore the potential of using harmonization data in conjunction with a segmentation model to enhance the detection of edited image regions. These edits can be either manually crafted or generated using deep learning methods. Our findings demonstrate that this approach can effectively identify such edits. Existing forensic models often overlook the detection of harmonized objects in relation to the background, but our proposed Disharmony Network addresses this gap. By utilizing an aggregated dataset of harmonization techniques, our model outperforms existing forensic networks in identifying harmonized objects integrated into their backgrounds, and shows potential for detecting various forms of edits, including virtual try-on tasks.
zh

[CV-9] Hypercone Assisted Contour Generation for Out-of-Distribution Detection

【速读】：该论文旨在解决分布外检测（Out-of-Distribution, OOD）中的关键问题，即如何在不假设数据分布的情况下，自动适应数据的分布特性，从而提高检测性能。现有的方法大多依赖于距离度量，而较少利用分布感知来提升效果。论文提出的解决方案HAC _k -OOD通过构建一组超锥体（hypercones），最大化给定数据点邻域内的角距离，从而近似描述分布内数据点（In-Distribution, ID）的轮廓。这种方法不需要显式训练OOD检测性能，实验结果表明，在CIFAR-100基准测试中，HAC _k -OOD在近OOD和远OOD检测任务上均达到了最先进的FPR@95和AUROC性能。

链接: https://arxiv.org/abs/2501.10209
作者: Annita Vapsi,Andrés Muñoz,Nancy Thomas,Keshav Ramani,Daniel Borrajo
机构: AI Research, J.P. Morgan Chase & Co. (J.P. 摩根大通公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in the field of out-of-distribution (OOD) detection have placed great emphasis on learning better representations suited to this task. While there are distance-based approaches, distributional awareness has seldom been exploited for better performance. We present HAC _k -OOD, a novel OOD detection method that makes no distributional assumption about the data, but automatically adapts to its distribution. Specifically, HAC _k -OOD constructs a set of hypercones by maximizing the angular distance to neighbors in a given data-point’s vicinity to approximate the contour within which in-distribution (ID) data-points lie. Experimental results show state-of-the-art FPR@95 and AUROC performance on Near-OOD detection and on Far-OOD detection on the challenging CIFAR-100 benchmark without explicitly training for OOD performance.
zh

[CV-10] Adaptive Clustering for Efficient Phenotype Segmentation of UAV Hyperspectral Data WACV2025

【速读】：该论文试图解决无人机（UAV）结合高光谱成像（HSI）在环境和农业应用中面临的数据密集性和计算资源有限的问题。具体而言，高光谱成像数据量大，而远程设备的计算资源和存储能力有限，这限制了实时树表型分割的应用。论文提出的解决方案是引入一种在线高光谱简单线性迭代聚类算法（OHSLIC）框架。该框架通过自适应增量聚类和轻量级神经网络来减少固有噪声和计算需求，从而实现对叶片中叶绿素、类胡萝卜素和花青素等生化特性的实时表型分割。OHSLIC的关键在于其能够在计算效率和精度之间进行动态权衡，显著减少了推理时间，并实现了优于基于像素或窗口的方法的回归精度和分割性能。这一方法为高光谱成像应用在边缘设备上的可扩展部署铺平了道路。

链接: https://arxiv.org/abs/2501.10199
作者: Ciem Cornelissen,Sam Leroux,Pieter Simoens
机构: IDLab, Department of Information Technology at Ghent University - imec (IDLab, 根特大学信息技术系 - imec)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: accepted WACV 2025 GeoCV workshop

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) combined with Hyperspectral imaging (HSI) offer potential for environmental and agricultural applications by capturing detailed spectral information that enables the prediction of invisible features like biochemical leaf properties. However, the data-intensive nature of HSI poses challenges for remote devices, which have limited computational resources and storage. This paper introduces an Online Hyperspectral Simple Linear Iterative Clustering algorithm (OHSLIC) framework for real-time tree phenotype segmentation. OHSLIC reduces inherent noise and computational demands through adaptive incremental clustering and a lightweight neural network, which phenotypes trees using leaf contents such as chlorophyll, carotenoids, and anthocyanins. A hyperspectral dataset is created using a custom simulator that incorporates realistic leaf parameters, and light interactions. Results demonstrate that OHSLIC achieves superior regression accuracy and segmentation performance compared to pixel- or window-based methods while significantly reducing inference time. The method`s adaptive clustering enables dynamic trade-offs between computational efficiency and accuracy, paving the way for scalable edge-device deployment in HSI applications.
zh

[CV-11] CSHNet: A Novel Information Asymmetric Image Translation Method

【速读】：该论文旨在解决跨域图像翻译（cross-domain image translation）中的不对称任务问题，特别是从细节较少的领域（如SAR图像或草图）向内容更丰富的领域（如光学图像或实例图像）转换时面临的挑战。传统基于卷积神经网络（CNN）的方法虽然在捕捉细节方面表现良好，但在全局结构处理上存在不足，容易导致图像区域的不当合并。为解决这一问题，论文提出了CNN-Swin混合网络（CSHNet），其核心在于结合了两个关键模块：Swin嵌入CNN（SEC）和CNN嵌入Swin（CES），形成了SEC-CES瓶颈（SCB）。SEC利用CNN的细节特征提取能力，同时集成了Swin Transformer的结构偏置；而CES则保留了Swin Transformer的全局完整性，弥补了CNN在结构关注上的不足。此外，CSHNet还引入了交互式引导连接（IGC）和自适应边缘感知损失（AEPL），分别用于增强跨域信息的动态交换和保持翻译过程中的结构边界。实验结果表明，CSHNet在视觉质量和性能指标上均优于现有方法。

链接: https://arxiv.org/abs/2501.10197
作者: Xi Yang,Haoyuan Shi,Zihan Wang,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Hangzhou Institute of Technology, Xidian University (杭州研究院, 西安电子科技大学); Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications (重庆邮电大学图像认知重庆市重点实验室); School of Electronic Engineering, Xidian University (西安电子科技大学电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite advancements in cross-domain image translation, challenges persist in asymmetric tasks such as SAR-to-Optical and Sketch-to-Instance conversions, which involve transforming data from a less detailed domain into one with richer content. Traditional CNN-based methods are effective at capturing fine details but struggle with global structure, leading to unwanted merging of image regions. To address this, we propose the CNN-Swin Hybrid Network (CSHNet), which combines two key modules: Swin Embedded CNN (SEC) and CNN Embedded Swin (CES), forming the SEC-CES-Bottleneck (SCB). SEC leverages CNN’s detailed feature extraction while integrating the Swin Transformer’s structural bias. CES, in turn, preserves the Swin Transformer’s global integrity, compensating for CNN’s lack of focus on structure. Additionally, CSHNet includes two components designed to enhance cross-domain information retention: the Interactive Guided Connection (IGC), which enables dynamic information exchange between SEC and CES, and Adaptive Edge Perception Loss (AEPL), which maintains structural boundaries during translation. Experimental results show that CSHNet outperforms existing methods in both visual quality and performance metrics across scene-level and instance-level datasets. Our code is available at: this https URL.
zh

[CV-12] Structure-guided Deep Multi-View Clustering

【速读】：该论文试图解决深度多视图聚类（Deep Multi-view Clustering）中现有方法未能充分挖掘多视图结构信息和数据分布的问题，从而限制了聚类性能。为了解决这些局限性，论文提出了一种结构引导的深度多视图聚类模型。其解决方案的关键在于两个方面：首先，引入了一种基于邻域关系的正样本选择策略，并结合相应的损失函数，通过构建多视图最近邻图来动态重新定义正样本对，从而挖掘多视图数据中的局部结构信息并增强正样本选择的可靠性；其次，引入高斯分布模型来揭示潜在的结构信息，并通过损失函数减少视图嵌入之间的差异。这两种策略从不同角度探索了多视图结构信息和数据分布，增强了视图间的一致性并提高了簇内紧凑性。实验结果表明，该方法在多个基准数据集上的聚类性能显著优于现有的多视图聚类方法。

链接: https://arxiv.org/abs/2501.10157
作者: Jinrong Cui,Xiaohuang Wu,Haitao Zhang,Chongjie Dong,Jie Wen
机构: College of Mathematics and Informatics, South China Agricultural University (华南农业大学数学与信息学院); Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China (电子科技大学深圳高等研究院); Dongguan Polytechnic (东莞职业技术学院); Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology (哈尔滨工业大学深圳视觉对象检测与识别重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep multi-view clustering seeks to utilize the abundant information from multiple views to improve clustering performance. However, most of the existing clustering methods often neglect to fully mine multi-view structural information and fail to explore the distribution of multi-view data, limiting clustering performance. To address these limitations, we propose a structure-guided deep multi-view clustering model. Specifically, we introduce a positive sample selection strategy based on neighborhood relationships, coupled with a corresponding loss function. This strategy constructs multi-view nearest neighbor graphs to dynamically redefine positive sample pairs, enabling the mining of local structural information within multi-view data and enhancing the reliability of positive sample selection. Additionally, we introduce a Gaussian distribution model to uncover latent structural information and introduce a loss function to reduce discrepancies between view embeddings. These two strategies explore multi-view structural information and data distribution from different perspectives, enhancing consistency across views and increasing intra-cluster compactness. Experimental evaluations demonstrate the efficacy of our method, showing significant improvements in clustering performance on multiple benchmark datasets compared to state-of-the-art multi-view clustering approaches.
zh

[CV-13] A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

【速读】：该论文试图解决遥感场景理解中生成复杂环境（如多种土地利用区域或沿海地区，可能包括雪、云或雾霾）准确表示的挑战。解决方案的关键在于提出了一种名为Spectral LLaVA的视觉-语言框架，该框架通过将多光谱数据与视觉-语言对齐技术相结合，增强了场景表示和描述能力。具体而言，该框架在Sentinel-2的BigEarthNet v2数据集上建立了基于RGB的场景描述基线，并通过引入多光谱信息显著提升了性能。Spectral LLaVA优化了一个轻量级的线性投影层以实现对齐，同时保持SpectralGPT的视觉骨干网络冻结。实验结果表明，该框架能够生成详细且准确的场景描述，特别是在RGB数据不足的情况下，同时通过将SpectralGPT特征优化为语义上有意义的表示，提升了分类性能。

链接: https://arxiv.org/abs/2501.10144
作者: Enes Karanfil,Nevrez Imamoglu,Erkut Erdem,Aykut Erdem
机构: AIST, Tokyo, Japan (日本产业技术综合研究所); Hacettepe University, Ankara, Turkey (哈塞特佩大学); Koç University, Istanbul, Turkey (科奇大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA’s ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.
zh

[CV-14] ACE: Anatomically Consistent Embeddings in Composition and Decomposition WACV2025

【速读】：该论文试图解决现有自监督学习（SSL）方法在医学图像处理中未能充分利用医学图像固有的可组合/可分解结构属性的问题。医学图像通常包含一致的宏观或微观解剖结构，这些结构由可组合/可分解的器官和组织构成，而现有的SSL方法未能有效利用这些特性。为此，论文提出了一种名为ACE的新型SSL方法，通过组合和分解来学习解剖学上一致的嵌入表示。ACE的关键创新在于其两个核心分支：(1) 全局一致性分支，通过提取全局特征来捕捉具有区分性的宏观结构；(2) 局部一致性分支，通过相应的矩阵匹配从可组合/可分解的局部图像块特征中学习细粒度的解剖细节。实验结果表明，ACE在少样本学习、微调和属性分析中表现出卓越的鲁棒性、可迁移性和临床潜力。

链接: https://arxiv.org/abs/2501.10131
作者: Ziyu Zhou,Haozhe Luo,Mohammad Reza Hosseinzadeh Taher,Jiaxuan Pang,Xiaowei Ding,Michael Gotway,Jianming Liang
机构: Shanghai Jiao Tong University (上海交通大学); Arizona State University (亚利桑那州立大学); University of Bern (伯尔尼大学); Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2025

点击查看摘要

Abstract:Medical images acquired from standardized protocols show consistent macroscopic or microscopic anatomical structures, and these structures consist of composable/decomposable organs and tissues, but existing self-supervised learning (SSL) methods do not appreciate such composable/decomposable structure attributes inherent to medical images. To overcome this limitation, this paper introduces a novel SSL approach called ACE to learn anatomically consistent embedding via composition and decomposition with two key branches: (1) global consistency, capturing discriminative macro-structures via extracting global features; (2) local consistency, learning fine-grained anatomical details from composable/decomposable patch features via corresponding matrix matching. Experimental results across 6 datasets 2 backbones, evaluated in few-shot learning, fine-tuning, and property analysis, show ACE’s superior robustness, transferability, and clinical potential. The innovations of our ACE lie in grid-wise image cropping, leveraging the intrinsic properties of compositionality and decompositionality of medical images, bridging the semantic gap from high-level pathologies to low-level tissue anomalies, and providing a new SSL method for medical imaging.
zh

[CV-15] Spatio-temporal Graph Learning on Adaptive Mined Key Frames for High-performance Multi-Object Tracking

【速读】：该论文试图解决多目标跟踪（multi-object tracking）中由于物体间频繁遮挡（mutual occlusions）导致的跟踪误差和性能下降问题。解决方案的关键在于提出了两种创新模块：一是基于强化学习（reinforcement learning）的自适应关键帧提取（Key Frame Extraction, KFE）模块，用于自适应分割视频并捕捉视频内容的内在逻辑，从而更好地捕获物体间的空间和时间关系；二是帧内特征融合（Intra-Frame Feature Fusion, IFF）模块，利用图卷积网络（Graph Convolutional Network, GCN）在单帧内促进目标物体与周围物体之间的信息交换，显著提高了目标的可区分性，并减少了由于遮挡导致的跟踪丢失和外观相似性问题。通过结合长轨迹和短轨迹的优势，并考虑物体间的空间关系，该跟踪器在MOT17数据集上取得了显著的效果提升。

链接: https://arxiv.org/abs/2501.10129
作者: Futian Wang,Fengxiang Liu,Xiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the realm of multi-object tracking, the challenge of accurately capturing the spatial and temporal relationships between objects in video sequences remains a significant hurdle. This is further complicated by frequent occurrences of mutual occlusions among objects, which can lead to tracking errors and reduced performance in existing methods. Motivated by these challenges, we propose a novel adaptive key frame mining strategy that addresses the limitations of current tracking approaches. Specifically, we introduce a Key Frame Extraction (KFE) module that leverages reinforcement learning to adaptively segment videos, thereby guiding the tracker to exploit the intrinsic logic of the video content. This approach allows us to capture structured spatial relationships between different objects as well as the temporal relationships of objects across frames. To tackle the issue of object occlusions, we have developed an Intra-Frame Feature Fusion (IFF) module. Unlike traditional graph-based methods that primarily focus on inter-frame feature fusion, our IFF module uses a Graph Convolutional Network (GCN) to facilitate information exchange between the target and surrounding objects within a frame. This innovation significantly enhances target distinguishability and mitigates tracking loss and appearance similarity due to occlusions. By combining the strengths of both long and short trajectories and considering the spatial relationships between objects, our proposed tracker achieves impressive results on the MOT17 dataset, i.e., 68.6 HOTA, 81.0 IDF1, 66.6 AssA, and 893 IDS, proving its effectiveness and accuracy.
zh

[CV-16] DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency

【速读】：该论文旨在解决基于扩散模型（Diffusion Models）在视频超分辨率（video super-resolution）应用中面临的两个主要挑战：高保真度和时间一致性（temporal consistency）的平衡问题。为了解决这些问题，论文提出了DiffVSR框架，其关键创新包括：1）多尺度时间注意力模块（multi-scale temporal attention module）和时间增强的VAE解码器（temporal-enhanced VAE decoder），用于捕捉细粒度的运动细节，确保序列内的连贯性（intra-sequence coherence）；2）噪声重调度机制（noise rescheduling mechanism）和交织潜在转换方法（interweaved latent transition approach），以增强序列间的稳定性（inter-sequence stability），而无需额外的训练开销；3）渐进式学习策略（progressive learning strategy），从简单到复杂的退化过程进行优化，从而在高质量视频数据有限的情况下实现鲁棒的优化。通过这些创新，DiffVSR在视觉质量和时间一致性方面均表现出色，为真实场景下的视频超分辨率设定了新的性能标准。

链接: https://arxiv.org/abs/2501.10110
作者: Xiaohui Li,Yihao Liu,Shuo Cao,Ziyan Chen,Shaobin Zhuang,Xiangyu Chen,Yinan He,Yi Wang,Yu Qiao
机构: Shanghai Jiao Tong University(上海交通大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); University of Science and Technology of China(中国科学技术大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: \url{ this https URL }

点击查看摘要

Abstract:Diffusion models have demonstrated exceptional capabilities in image generation and restoration, yet their application to video super-resolution faces significant challenges in maintaining both high fidelity and temporal consistency. We present DiffVSR, a diffusion-based framework for real-world video super-resolution that effectively addresses these challenges through key innovations. For intra-sequence coherence, we develop a multi-scale temporal attention module and temporal-enhanced VAE decoder that capture fine-grained motion details. To ensure inter-sequence stability, we introduce a noise rescheduling mechanism with an interweaved latent transition approach, which enhances temporal consistency without additional training overhead. We propose a progressive learning strategy that transitions from simple to complex degradations, enabling robust optimization despite limited high-quality video data. Extensive experiments demonstrate that DiffVSR delivers superior results in both visual quality and temporal consistency, setting a new performance standard in real-world video super-resolution.
zh

[CV-17] Universal Actions for Enhanced Embodied Foundation Models

【速读】：该论文试图解决在构建具身智能体（embodied agents）时，由于不同机器人的物理具身和控制接口的差异，导致动作空间（action space）存在显著异质性（heterogeneity），从而难以利用跨领域数据开发具身基础模型（embodied foundation models）的问题。解决方案的关键在于提出了UniAct框架，该框架通过在一个标记化的通用动作空间（tokenized Universal Action Space）中操作，捕捉不同机器人之间共享的结构特征，从而提取通用的原子行为（generic atomic behaviors）。这些通用动作消除了异质性，增强了跨领域数据的利用和跨具身泛化能力。通过简单地添加具身特定的细节，通用动作可以高效地转换为异构的可执行命令，使得快速适应新机器人变得简单直接。UniAct的0.5B实例在多个真实世界和仿真机器人上的广泛评估中，表现优于14倍大的现有最佳具身基础模型，展示了卓越的跨具身控制和适应能力，凸显了采用通用动作的关键优势。

链接: https://arxiv.org/abs/2501.10105
作者: Jinliang Zheng,Jianxiong Li,Dongxiu Liu,Yinan Zheng,Zhihao Wang,Zhonghong Ou,Yu Liu,Jingjing Liu,Ya-Qin Zhang,Xianyuan Zhan
机构: 1 AIR, Tsinghua University (清华大学); 2 Sensetime Research (商汤科技研究院); 3 Peking University (北京大学); 4 Beijing University of Posts and Telecommunications (北京邮电大学); 5 Shanghai AI Lab (上海人工智能实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-domain data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in a tokenized Universal Action Space. Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. The universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions. Project page: this https URL
zh

[CV-18] andmarker: a Toolkit for Anatomical Landmark Localization in 2D/3D Images

【速读】：该论文旨在解决医学影像领域中二维/三维图像解剖标志点定位（anatomical landmark localization）的挑战。现有的通用计算机视觉工具（如姿态估计工具）缺乏针对医学领域解剖标志点定位所需的专业功能和模块化设计。为此，论文提出了一个基于PyTorch的Python工具包——landmarker。该工具包的核心解决方案在于提供了一个全面且灵活的工具集，支持多种方法（如静态和自适应热图回归），并能够提高标志点识别的准确性，简化研发流程，支持多种图像格式和预处理流程。其模块化设计允许用户根据特定数据集和应用进行定制和扩展，从而加速医学影像领域的创新。landmarker的关键在于其针对医学影像任务的精确性和定制化需求，弥补了现有通用工具在此领域的不足。

链接: https://arxiv.org/abs/2501.10098
作者: Jef Jonkers,Luc Duchateau,Glenn Van Wallendael,Sofie Van Hoecke
机构: Ghent University(根特大学); Ghent University - imec(根特大学 - imec)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Anatomical landmark localization in 2D/3D images is a critical task in medical imaging. Although many general-purpose tools exist for landmark localization in classical computer vision tasks, such as pose estimation, they lack the specialized features and modularity necessary for anatomical landmark localization applications in the medical domain. Therefore, we introduce landmarker, a Python package built on PyTorch. The package provides a comprehensive, flexible toolkit for developing and evaluating landmark localization algorithms, supporting a range of methodologies, including static and adaptive heatmap regression. landmarker enhances the accuracy of landmark identification, streamlines research and development processes, and supports various image formats and preprocessing pipelines. Its modular design allows users to customize and extend the toolkit for specific datasets and applications, accelerating innovation in medical imaging. landmarker addresses a critical need for precision and customization in landmark localization tasks not adequately met by existing general-purpose pose estimation tools.
zh

[CV-19] Classifier Ensemble for Efficient Uncertainty Calibration of Deep Neural Networks for Image Classification

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在图像分类任务中存在的校准误差（calibration error）问题。尽管现有的深度神经网络在标准数据集上表现出较高的分类准确率，但它们往往存在显著的校准误差，即模型输出的置信度与真实概率之间存在偏差。论文提出了一种基于元模型（metamodel）的分类器集成（classifier ensemble）方法，以改善模型的校准性能。关键解决方案包括：1）通过多数投票（majority voting）和元模型集成技术构建简单而高效的分类器集成；2）元模型集成方法在减少期望校准误差（Expected Calibration Error, ECE）和最大校准误差（Maximum Calibration Error, MCE）方面表现优异，且对模型准确率影响最小；3）与传统模型集成相比，元模型集成在显著减少参数量的同时，提升了校准性能；4）该方法无需额外的校准数据集，优于传统的后处理校准方法。这些发现表明，基于元模型的分类器集成是一种高效且有效的模型校准改进方法，有助于提升深度学习系统的可靠性。

链接: https://arxiv.org/abs/2501.10089
作者: Michael Schulze,Nikolas Ebert,Laurenz Reichardt,Oliver Wasenmüller
机构: Mannheim University of Applied Sciences, Germany (曼海姆应用科学大学, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted at International Conference on Computer Vision Theory and Applications (VISAPP), 2025

点击查看摘要

Abstract:This paper investigates novel classifier ensemble techniques for uncertainty calibration applied to various deep neural networks for image classification. We evaluate both accuracy and calibration metrics, focusing on Expected Calibration Error (ECE) and Maximum Calibration Error (MCE). Our work compares different methods for building simple yet efficient classifier ensembles, including majority voting and several metamodel-based approaches. Our evaluation reveals that while state-of-the-art deep neural networks for image classification achieve high accuracy on standard datasets, they frequently suffer from significant calibration errors. Basic ensemble techniques like majority voting provide modest improvements, while metamodel-based ensembles consistently reduce ECE and MCE across all architectures. Notably, the largest of our compared metamodels demonstrate the most substantial calibration improvements, with minimal impact on accuracy. Moreover, classifier ensembles with metamodels outperform traditional model ensembles in calibration performance, while requiring significantly fewer parameters. In comparison to traditional post-hoc calibration methods, our approach removes the need for a separate calibration dataset. These findings underscore the potential of our proposed metamodel-based classifier ensembles as an efficient and effective approach to improving model calibration, thereby contributing to more reliable deep learning systems.
zh

[CV-20] Leverag ing Confident Image Regions for Source-Free Domain-Adaptive Object Detection

【速读】：该论文试图解决源数据不可用情况下的领域自适应目标检测（source-free domain-adaptive object detection）问题。具体来说，研究的目标是在不依赖源数据的情况下，将预训练的检测器（source-pretrained detector）适应到一个不同的目标领域（target domain）。目前，针对这一问题的数据增强方案尚不完善。为此，论文提出了一种新颖的数据增强方法，该方法通过裁剪目标图像中检测器置信度较高的区域，并结合其伪标签（pseudo-labels）进行增强，然后将这些增强后的区域组合成一个具有挑战性的目标图像，以进一步适应检测器。由于在适应过程中无法访问源数据，研究采用了教师-学生（teacher-student）学习范式，以确保模型在适应过程中不会崩溃。该方法在三个交通场景的自适应基准测试中进行了评估，并在其中两个基准上取得了新的最先进水平。

链接: https://arxiv.org/abs/2501.10081
作者: Mohamed Lamine Mekhalfi,Davide Boscaini,Fabio Poiesi
机构: Fondazione Bruno Kessler (FBK) (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Source-free domain-adaptive object detection is an interesting but scarcely addressed topic. It aims at adapting a source-pretrained detector to a distinct target domain without resorting to source data during adaptation. So far, there is no data augmentation scheme tailored to source-free domain-adaptive object detection. To this end, this paper presents a novel data augmentation approach that cuts out target image regions where the detector is confident, augments them along with their respective pseudo-labels, and joins them into a challenging target image to adapt the detector. As the source data is out of reach during adaptation, we implement our approach within a teacher-student learning paradigm to ensure that the model does not collapse during the adaptation procedure. We evaluated our approach on three adaptation benchmarks of traffic scenes, scoring new state-of-the-art on two of them.
zh

[CV-21] Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks WACV

【速读】：该论文旨在解决少样本语义分割（few-shot semantic segmentation）问题，特别是在涉及具有空间和层次关系的多部件机械场景中。解决方案的关键在于整合多个基础模型（foundation models），包括CLIPSeg、Segment Anything Model (SAM)、兴趣点检测器SuperPoint以及图卷积网络（GCN）。通过提供1到25个标注样本，该模型能够在合成数据集上实现有效的分割，并在真实数据上展现出强大的泛化能力。具体而言，模型在仅使用10个合成支持样本的情况下，在真实数据上达到了92.2的J\F分数，展示了从合成数据到真实数据的定性泛化能力。此外，该方法在DAVIS 2017数据集上的半监督视频分割任务中，使用3个支持样本达到了71.5的J\F分数。该方法的快速训练时间和对真实数据的有效泛化能力，使其成为与机械和基础设施交互的自主系统的有力工具，同时也展示了组合和协调基础模型在少样本分割任务中的潜力。

链接: https://arxiv.org/abs/2501.10080
作者: Michael Schwingshackl,Fabio Francisco Oberweger,Markus Murschitz
机构: AIT Austrian Institute of Technology(奥地利技术研究所); Center for Vision, Automation & Control(视觉、自动化与控制中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Winter Conference on Applications of Computer Vision (WACV) 2025. Code and available at this https URL

点击查看摘要

Abstract:This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a J\F score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a J\F score of 71.5 in semi-supervised video segmentation with three support samples. This method’s fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.
zh

[CV-22] Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework

【速读】：该论文旨在解决遥感变化描述（Remote Sensing Change Captioning, RSICC）任务中的挑战，包括光照差异、视角变化、模糊效应以及不同空间分辨率和配准误差导致的描述不准确问题，尤其是在无变化区域。为了解决这些问题，作者提出了SECOND-CC数据集，该数据集包含6,041对高分辨率RGB图像对、语义分割图以及多样化的真实场景，并提供了30,205句描述图像差异的句子。此外，作者提出了MModalCC，一种多模态框架，通过引入跨模态交叉注意力（Cross-Modal Cross Attention, CMCA）和多模态门控交叉注意力（Multimodal Gated Cross Attention, MGCA）等先进的注意力机制，整合语义和视觉数据。实验表明，MModalCC在BLEU4和CIDEr评分上分别比现有最优方法（如RSICCformer、Chg2Cap和PSNet）提升了4.6%和9.6%，有效应对了RSICC任务中的挑战。

链接: https://arxiv.org/abs/2501.10075
作者: Ali Can Karaca,M. Enes Ozelbas,Saadettin Berber,Orkhan Karimli,Turabi Yildirim,M. Fatih Amasyali
机构: Department of Computer Engineering, Yildiz Technical University (伊斯坦布尔耶尔德兹技术大学计算机工程系); Multimodal Systems for Artificial Intelligence and Computation (MOSAIC) Research Group, Yildiz Technical University (伊斯坦布尔耶尔德兹技术大学多模态人工智能与计算系统研究组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: This work has been submitted to the IEEE Transactions on Geoscience and Remote Sensing journal for possible publication

点击查看摘要

Abstract:Remote sensing change captioning (RSICC) aims to describe changes between bitemporal images in natural language. Existing methods often fail under challenges like illumination differences, viewpoint changes, blur effects, leading to inaccuracies, especially in no-change regions. Moreover, the images acquired at different spatial resolutions and have registration errors tend to affect the captions. To address these issues, we introduce SECOND-CC, a novel RSICC dataset featuring high-resolution RGB image pairs, semantic segmentation maps, and diverse real-world scenarios. SECOND-CC which contains 6,041 pairs of bitemporal RS images and 30,205 sentences describing the differences between images. Additionally, we propose MModalCC, a multimodal framework that integrates semantic and visual data using advanced attention mechanisms, including Cross-Modal Cross Attention (CMCA) and Multimodal Gated Cross Attention (MGCA). Detailed ablation studies and attention visualizations further demonstrate its effectiveness and ability to address RSICC challenges. Comprehensive experiments show that MModalCC outperforms state-of-the-art RSICC methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on BLEU4 score and +9.6% improvement on CIDEr score. We will make our dataset and codebase publicly available to facilitate future research at this https URL
zh

[CV-23] SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning

【速读】：该论文旨在解决具身人工智能（Embodied AI）研究中的空间推理（Spatial Reasoning）问题。现有方法通过补充空间数据和微调来增强空间推理能力，但在处理复杂具身任务时效果有限，主要依赖于基于语言的输出。尽管一些方法引入了基于点的动作空间来缓解这一问题，但在复杂环境中处理更精细任务时仍显不足，原因是未能充分利用视觉-语言模型（Vision-Language Models, VLMs）固有的思维和推理能力。为解决这些局限性，论文提出了一种名为SpatialCoT的新方法，其核心包括两个阶段：空间坐标双向对齐（Spatial Coordinate Bi-directional Alignment），将视觉-语言输入与空间坐标对齐；以及思维链空间接地（Chain-of-Thought Spatial Grounding），利用语言模型的推理能力进行高级空间推理。实验结果表明，该方法在导航和操作任务中显著优于现有最先进方法。

链接: https://arxiv.org/abs/2501.10074
作者: Yuecheng Liu,Dafeng Chi,Shiguang Wu,Zhanguang Zhang,Yaochen Hu,Lingfeng Zhang,Yingxue Zhang,Shuang Wu,Tongtong Cao,Guowei Huang,Guangjian Tian,Xingyue Quan,Jianye Hao,Yuzheng Zhuang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.
zh

[CV-24] CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment

【速读】：该论文旨在解决现有无参考点云质量评估（No-Reference Point Cloud Quality Assessment, NR-PCQA）方法中存在的直接映射视觉数据到平均意见分数（Mean Opinion Score, MOS）的问题，这种方法与实际主观评估机制不符。为了解决这一问题，论文提出了一种名为CLIP-PCQA的新型语言驱动点云质量评估方法。其关键解决方案在于采用基于检索的映射策略，模拟人类主观评估过程。具体而言，该方法基于CLIP（Contrastive Language–Image Pretraining）的思想，通过计算视觉特征与多个对应不同质量描述的文本特征之间的余弦相似度，并引入有效的对比损失和可学习的提示词（learnable prompts）来增强特征提取。此外，考虑到主观实验中的个人局限性和偏差，该方法进一步将特征相似度转换为概率，并将意见分数分布（Opinion Score Distribution, OSD）而非单一MOS作为最终目标。实验结果表明，CLIP-PCQA在性能上优于其他现有的最先进方法。

链接: https://arxiv.org/abs/2501.10071
作者: Yating Liu,Yujie Zhang,Ziyu Shan,Yiling Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In recent years, No-Reference Point Cloud Quality Assessment (NR-PCQA) research has achieved significant progress. However, existing methods mostly seek a direct mapping function from visual data to the Mean Opinion Score (MOS), which is contradictory to the mechanism of practical subjective evaluation. To address this, we propose a novel language-driven PCQA method named CLIP-PCQA. Considering that human beings prefer to describe visual quality using discrete quality descriptions (e.g., “excellent” and “poor”) rather than specific scores, we adopt a retrieval-based mapping strategy to simulate the process of subjective assessment. More specifically, based on the philosophy of CLIP, we calculate the cosine similarity between the visual features and multiple textual features corresponding to different quality descriptions, in which process an effective contrastive loss and learnable prompts are introduced to enhance the feature extraction. Meanwhile, given the personal limitations and bias in subjective experiments, we further covert the feature similarities into probabilities and consider the Opinion Score Distribution (OSD) rather than a single MOS as the final target. Experimental results show that our CLIP-PCQA outperforms other State-Of-The-Art (SOTA) approaches.
zh

[CV-25] FiLo: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization

【速读】：该论文旨在解决零样本（zero-shot）和少样本（few-shot）异常检测中的两个主要问题：一是现有方法依赖手工设计的通用描述，难以捕捉不同对象中可能出现的多样化异常；二是简单的图像-文本匹配方法在定位形状和大小各异的异常区域时表现不佳。为解决这些问题，论文提出了FiLo++方法，其核心包括两个关键组件：Fused Fine-Grained Descriptions (FusDes) 和 Deformable Localization (DefLoc)。FusDes利用大语言模型生成针对每个对象类别的异常描述，结合固定和可学习的提示模板，并通过运行时提示过滤方法生成更准确且任务特定的文本描述。DefLoc则整合了视觉基础模型Grounding DINO与位置增强的文本描述，并引入多尺度可变形跨模态交互（MDCI）模块，以实现对不同形状和大小异常区域的精确定位。此外，论文还设计了位置增强的块匹配方法，以提升少样本异常检测的性能。实验结果表明，FiLo++在多个数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2501.10067
作者: Zhaopeng Gu,Bingke Zhu,Guibo Zhu,Yingying Chen,Ming Tang,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China(中国科学院自动化研究所基础模型研究中心); School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学人工智能学院); Objecteye Inc., Beijing 100190, China(Objecteye公司); Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China(上海人工智能实验室); Wuhan AI Research, Wuhan 430073, China(武汉人工智能研究院); Peng Cheng Laboratory, Shenzhen 518066, China(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at this https URL.
zh

[CV-26] One-D-Piece: Image Tokenizer Meets Quality-Controllable Compression

【速读】：该论文旨在解决当前图像标记化方法（image tokenization）中存在的固定长度标记化（fixed-length tokenization）问题，这种固定长度标记化导致信息分配效率低下，无法根据图像内容动态调整标记数量。为解决这一问题，论文提出了一种名为One-D-Piece的离散图像标记化方法，支持可变长度标记化（variable-length tokenization），并通过引入“尾部标记丢弃”（Tail Token Drop）的正则化机制，实现了质量可控的压缩率。该机制通过将关键信息集中在标记序列的头部，支持可变长度标记化，同时保持了最先进的重建质量。实验结果表明，该方法在较小的字节大小下，显著优于现有的质量可控压缩方法（如JPEG和WebP），并在多种计算机视觉任务（如图像分类、目标检测、语义分割和深度估计）中表现出广泛的适应性。

链接: https://arxiv.org/abs/2501.10064
作者: Keita Miwa,Kento Sasaki,Hidehisa Arai,Tsubasa Takahashi,Yu Yamaguchi
机构: Turing Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Our Project Page: this https URL

点击查看摘要

Abstract:Current image tokenization methods require a large number of tokens to capture the information contained within images. Although the amount of information varies across images, most image tokenizers only support fixed-length tokenization, leading to inefficiency in token allocation. In this study, we introduce One-D-Piece, a discrete image tokenizer designed for variable-length tokenization, achieving quality-controllable mechanism. To enable variable compression rate, we introduce a simple but effective regularization mechanism named “Tail Token Drop” into discrete one-dimensional image tokenizers. This method encourages critical information to concentrate at the head of the token sequence, enabling support of variadic tokenization, while preserving state-of-the-art reconstruction quality. We evaluate our tokenizer across multiple reconstruction quality metrics and find that it delivers significantly better perceptual quality than existing quality-controllable compression methods, including JPEG and WebP, at smaller byte sizes. Furthermore, we assess our tokenizer on various downstream computer vision tasks, including image classification, object detection, semantic segmentation, and depth estimation, confirming its adaptability to numerous applications compared to other variable-rate methods. Our approach demonstrates the versatility of variable-length discrete image tokenization, establishing a new paradigm in both compression efficiency and reconstruction performance. Finally, we validate the effectiveness of tail token drop via detailed analysis of tokenizers.
zh

[CV-27] LWGANet: A Lightweight Group Attention Backbone for Remote Sensing Visual Tasks

【速读】：该论文旨在解决遥感（Remote Sensing, RS）视觉任务中多尺度对象检测和识别所面临的挑战，特别是在资源受限设备上部署时的高计算需求和参数数量增加的问题。现有的双分支或多分支架构虽然能有效处理对象尺度变化，但带来了显著的计算负担，限制了其在资源受限设备上的应用。此外，现有的轻量级骨干网络（lightweight backbone networks）主要针对自然图像设计，难以有效提取多尺度对象的特征，影响了其在遥感视觉任务中的性能。

论文提出的解决方案是引入一种专门为遥感视觉任务设计的轻量级骨干网络LWGANet，其核心创新在于一种新颖的轻量级组注意力（Lightweight Group Attention, LWGA）模块。该模块能够在不增加复杂性和计算开销的情况下，充分利用冗余特征，从局部到全局尺度提取广泛的空间信息，从而实现高效且精确的多尺度特征提取。通过这一设计，LWGANet在保持高性能的同时，显著降低了计算复杂度，适用于资源受限的场景。实验结果表明，LWGANet在多个遥感视觉任务（如场景分类、定向目标检测、语义分割和变化检测）中均达到了最优性能（SOTA），展示了其广泛的适用性和高效性。

链接: https://arxiv.org/abs/2501.10040
作者: Wei Lu,Si-Bao Chen,Chris H. Q. Ding,Jin Tang,Bin Luo
机构: MOE Key Laboratory of ICSP, IMIS Laboratory of Anhui, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Zenmorn-AHU AI Joint Laboratory, School of Computer Science and Technology, Anhui University (安徽大学); School of Data Science (SDS), Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, Remote sensing

点击查看摘要

Abstract:Remote sensing (RS) visual tasks have gained significant academic and practical importance. However, they encounter numerous challenges that hinder effective feature extraction, including the detection and recognition of multiple objects exhibiting substantial variations in scale within a single image. While prior dual-branch or multi-branch architectural strategies have been effective in managing these object variances, they have concurrently resulted in considerable increases in computational demands and parameter counts. Consequently, these architectures are rendered less viable for deployment on resource-constrained devices. Contemporary lightweight backbone networks, designed primarily for natural images, frequently encounter difficulties in effectively extracting features from multi-scale objects, which compromises their efficacy in RS visual tasks. This article introduces LWGANet, a specialized lightweight backbone network tailored for RS visual tasks, incorporating a novel lightweight group attention (LWGA) module designed to address these specific challenges. LWGA module, tailored for RS imagery, adeptly harnesses redundant features to extract a wide range of spatial information, from local to global scales, without introducing additional complexity or computational overhead. This facilitates precise feature extraction across multiple scales within an efficient this http URL was rigorously evaluated across twelve datasets, which span four crucial RS visual tasks: scene classification, oriented object detection, semantic segmentation, and change detection. The results confirm LWGANet’s widespread applicability and its ability to maintain an optimal balance between high performance and low complexity, achieving SOTA results across diverse datasets. LWGANet emerged as a novel solution for resource-limited scenarios requiring robust RS image processing capabilities.
zh

[CV-28] X-Dyna: Expressive Dynamic Human Image Animation

【速读】：该论文旨在解决基于单张人类图像生成动画时动态细节丢失的问题，特别是在面部表情和身体动作的逼真度方面。现有的方法主要依赖于人体姿态控制，但在生成过程中往往无法保留复杂的动态细节，导致动画缺乏真实感。为此，论文提出了X-Dyna，一种基于扩散模型（diffusion-based）的零样本（zero-shot）动画生成框架。其核心解决方案包括两个关键组件：一是Dynamics-Adapter，这是一个轻量级模块，能够将参考外观上下文有效地整合到扩散模型的空间注意力机制中，同时保留运动模块生成流畅且复杂动态细节的能力；二是局部控制模块，用于捕捉与身份解耦的面部表情，从而实现精确的表情迁移，进一步提升动画的真实感。通过结合这些组件，X-Dyna能够从多样化的人类和场景视频中学习物理人体运动和自然场景动态，生成高度逼真且富有表现力的动画。实验结果表明，X-Dyna在定性和定量评估中均优于现有方法。

链接: https://arxiv.org/abs/2501.10021
作者: Di Chang,Hongyi Xu,You Xie,Yipeng Gao,Zhengfei Kuang,Shengqu Cai,Chenxu Zhang,Guoxian Song,Chao Wang,Yichun Shi,Zeyuan Chen,Shijie Zhou,Linjie Luo,Gordon Wetzstein,Mohammad Soleymani
机构: University of Southern California(南加州大学); ByteDance(字节跳动); Stanford University(斯坦福大学); University of California Los Angeles(加州大学洛杉矶分校); University of California San Diego(加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key shortcomings causing the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations. The code is available at this https URL.
zh

[CV-29] xtoon: Generating Vivid 2D Cartoon Characters from Text Descriptions

【速读】：该论文试图解决在数字角色创作中，2D卡通风格角色生成相对较少受到关注的问题。尽管数字人类技术在逼真数字人类和3D角色方面取得了显著进展，但交互式2D卡通角色的研究相对较少。与需要复杂构建和资源密集型渲染的3D角色不同，2D卡通角色通常使用Live2D格式，该格式通过模拟3D运动来动画化2D角色，而无需构建完整的3D模型，且采用轻量级的HTML5渲染，提高了可访问性和效率。

论文提出的解决方案是Textoon，这是一种基于文本描述生成多样化2D卡通角色的创新方法。Textoon利用先进的自然语言处理（NLP）和计算机视觉（CV）模型来理解文本意图并生成2D外观，能够在一分钟内创建多种令人惊叹且具有交互性的2D角色。该方法的关键在于结合语言和视觉模型，实现了从文本到2D卡通角色的高效生成。

链接: https://arxiv.org/abs/2501.10020
作者: Chao He,Jianqiang Ren,Liefeng Bo
机构: Tongyi Lab(通义实验室), Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The 2D cartoon style is a prominent art form in digital character creation, particularly popular among younger audiences. While advancements in digital human technology have spurred extensive research into photorealistic digital humans and 3D characters, interactive 2D cartoon characters have received comparatively less attention. Unlike 3D counterparts, which require sophisticated construction and resource-intensive rendering, Live2D, a widely-used format for 2D cartoon characters, offers a more efficient alternative, which allows to animate 2D characters in a manner that simulates 3D movement without the necessity of building a complete 3D model. Furthermore, Live2D employs lightweight HTML5 (H5) rendering, improving both accessibility and efficiency. In this technical report, we introduce Textoon, an innovative method for generating diverse 2D cartoon characters in the Live2D format based on text descriptions. The Textoon leverages cutting-edge language and vision models to comprehend textual intentions and generate 2D appearance, capable of creating a wide variety of stunning and interactive 2D characters within one minute. The project homepage is this https URL.
zh

[CV-30] DiffuEraser: A Diffusion Model for Video Inpainting

【速读】：该论文旨在解决视频修复（video inpainting）任务中，面对大面积掩码（large masks）时常见的模糊（blurring）和时间不一致性（temporal inconsistencies）问题。现有的方法通常结合基于光流（optical flow）的像素传播和基于视觉Transformer的生成技术，但在处理大面积掩码时效果有限。为此，论文提出了一种基于稳定扩散（stable diffusion）的视频修复模型DiffuEraser，通过引入先验信息（prior information）进行初始化和弱条件约束（weak conditioning），以减少噪声伪影（noisy artifacts）和抑制幻觉（hallucinations）。此外，为了提升长序列推理时的时间一致性，论文扩展了先验模型和DiffuEraser的时间感受野（temporal receptive fields），并利用视频扩散模型（Video Diffusion Models）的时间平滑特性进一步增强一致性。实验结果表明，该方法在内容完整性和时间一致性方面均优于现有技术，同时保持了可接受的效率。

链接: https://arxiv.org/abs/2501.10018
作者: Xiaowen Li,Haolan Xue,Peiran Ren,Liefeng Bo
机构: Tongyi Lab, Alibaba Group(通义实验室, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11pages, 13figures

点击查看摘要

Abstract:Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.
zh

[CV-31] Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions ICASSP2025

【速读】：该论文旨在解决当前流行的大型视觉-语言模型（Large Vision-Language Models, LVLMs）在对象属性幻觉（Hallucinations on Object Attributes, HoOA）方面的问题，即模型在输入图像中对细粒度属性的错误判断。为解决这一问题，论文提出了一种新颖的方法，利用从单张图像生成的3D表示中采样的多视角图像作为视觉提示，为LVLMs提供更多视角的视觉信息。关键解决方案包括引入多视角图像增强的视觉-语言模型（Multiview Image Augmented VLM, MIAVLM），并结合多视角属性感知器（Multiview Attributes Perceiver, MAP）子模块，以消除输入图像顺序的影响，并将多视角图像的视觉信息与大型语言模型（Large Language Models, LLMs）对齐。此外，论文还设计了负指令（negative instructions）以减少LVLMs对“是”响应的偏向性。实验结果表明，该方法在缓解HoOA方面具有显著效果。

链接: https://arxiv.org/abs/2501.10011
作者: Zhijie Tan,Yuzhi Li,Shengwei Meng,Xiang Yuan,Weiping Li,Tong Mo,Bingce Wang,Xu Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Current popular Large Vision-Language Models (LVLMs) are suffering from Hallucinations on Object Attributes (HoOA), leading to incorrect determination of fine-grained attributes in the input images. Leveraging significant advancements in 3D generation from a single image, this paper proposes a novel method to mitigate HoOA in LVLMs. This method utilizes multiview images sampled from generated 3D representations as visual prompts for LVLMs, thereby providing more visual information from other viewpoints. Furthermore, we observe the input order of multiple multiview images significantly affects the performance of LVLMs. Consequently, we have devised Multiview Image Augmented VLM (MIAVLM), incorporating a Multiview Attributes Perceiver (MAP) submodule capable of simultaneously eliminating the influence of input image order and aligning visual information from multiview images with Large Language Models (LLMs). Besides, we designed and employed negative instructions to mitigate LVLMs’ bias towards ``Yes" responses. Comprehensive experiments demonstrate the effectiveness of our method.
zh

[CV-32] Deep Learning for Early Alzheimer Disease Detection with MRI Scans

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s Disease, AD）诊断中的准确性和效率问题，特别是在使用脑部MRI数据进行诊断时。研究通过比较现有的深度学习模型，包括卷积神经网络（Convolutional Neural Network, CNN）、贝叶斯卷积神经网络（Bayesian Convolutional Neural Network, BCNN）和U-net模型，来提升AD诊断的精度和效率。研究的关键在于利用Open Access Series of Imaging Studies（OASIS）脑部MRI数据集，并通过解决数据不平衡问题来确保模型的鲁棒性和可靠性。此外，研究通过敏感性、特异性和计算效率等指标对模型进行严格评估，以确定各模型的优缺点。这一比较分析不仅揭示了AI在AD诊断中的潜在革命性作用，还为未来医学影像和神经退行性疾病管理的创新提供了方向。

链接: https://arxiv.org/abs/2501.09999
作者: Mohammad Rafsan,Tamer Oraby,Upal Roy,Sanjeev Kumar,Hansapani Rodrigo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Alzheimer’s Disease is a neurodegenerative condition characterized by dementia and impairment in neurological function. The study primarily focuses on the individuals above age 40, affecting their memory, behavior, and cognitive processes of the brain. Alzheimer’s disease requires diagnosis by a detailed assessment of MRI scans and neuropsychological tests of the patients. This project compares existing deep learning models in the pursuit of enhancing the accuracy and efficiency of AD diagnosis, specifically focusing on the Convolutional Neural Network, Bayesian Convolutional Neural Network, and the U-net model with the Open Access Series of Imaging Studies brain MRI dataset. Besides, to ensure robustness and reliability in the model evaluations, we address the challenge of imbalance in data. We then perform rigorous evaluation to determine strengths and weaknesses for each model by considering sensitivity, specificity, and computational efficiency. This comparative analysis would shed light on the future role of AI in revolutionizing AD diagnostics but also paved ways for future innovation in medical imaging and the management of neurodegenerative diseases.
zh

[CV-33] Multi-Modal Attention Networks for Enhanced Segmentation and Depth Estimation of Subsurface Defects in Pulse Thermography

【速读】：该论文试图解决脉冲热成像（PT）技术在无损检测（NDT）中存在的性能限制问题。具体来说，现有的最先进技术通常独立处理主成分分析（PCA）和热成像信号重建（TSR）这两种模态，导致模型性能受限，因为这两种模态具有互补的语义特征。为解决这一问题，论文提出了PT-Fusion，一种基于多模态注意力机制的融合网络，通过融合PCA和TSR模态来增强对地下缺陷的分割和深度估计。PT-Fusion引入了两个新颖的特征融合模块：编码器注意力融合门（EAFG）和注意力增强解码块（AEDB），以有效融合PCA和TSR特征。此外，论文还提出了一种基于热成像序列随机采样的数据增强技术，以缓解PT数据集稀缺的问题。实验结果表明，PT-Fusion在缺陷分割和深度估计精度上优于现有的U-Net、注意力U-Net和3D-CNN等模型，性能提升幅度达10%。

链接: https://arxiv.org/abs/2501.09994
作者: Mohammed Salah,Naoufel Werghi,Davor Svetinovic,Yusra Abdulrahman
机构: Khalifa University of Science and Technology(哈利法科技大学); Department of Aerospace Engineering, Khalifa University, Abu Dhabi, UAE(哈利法科技大学航空航天工程系); Department of Computer Science, Khalifa University, Abu Dhabi, UAE(哈利法科技大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Pulse thermography, infrared thermography, defect segmentation, multi-modal networks, attention mechanism

点击查看摘要

Abstract:AI-driven pulse thermography (PT) has become a crucial tool in non-destructive testing (NDT), enabling automatic detection of hidden anomalies in various industrial components. Current state-of-the-art techniques feed segmentation and depth estimation networks compressed PT sequences using either Principal Component Analysis (PCA) or Thermographic Signal Reconstruction (TSR). However, treating these two modalities independently constrains the performance of PT inspection models as these representations possess complementary semantic features. To address this limitation, this work proposes PT-Fusion, a multi-modal attention-based fusion network that fuses both PCA and TSR modalities for defect segmentation and depth estimation of subsurface defects in PT setups. PT-Fusion introduces novel feature fusion modules, Encoder Attention Fusion Gate (EAFG) and Attention Enhanced Decoding Block (AEDB), to fuse PCA and TSR features for enhanced segmentation and depth estimation of subsurface defects. In addition, a novel data augmentation technique is proposed based on random data sampling from thermographic sequences to alleviate the scarcity of PT datasets. The proposed method is benchmarked against state-of-the-art PT inspection models, including U-Net, attention U-Net, and 3D-CNN on the Université Laval IRT-PVC dataset. The results demonstrate that PT-Fusion outperforms the aforementioned models in defect segmentation and depth estimation accuracies with a margin of 10%.
zh

[CV-34] Aneumo: A Large-Scale Comprehensive Synthetic Dataset of Aneurysm Hemodynamics

【速读】：该论文旨在解决颅内动脉瘤（Intracranial Aneurysm, IA）病理生理学和血流动力学机制研究中的局限性问题。颅内动脉瘤通常无症状，但一旦破裂可能导致严重的蛛网膜下腔出血（Subarachnoid Hemorrhage, SAH）。尽管临床实践通常基于个体因素和动脉瘤的形态特征，但其病理生理机制仍存在争议。为解决这一问题，研究构建了一个全面的血流动力学数据集，包含466个真实动脉瘤模型和10,000个通过切除和变形操作生成的合成模型，其中包括466个无动脉瘤模型和9,534个变形动脉瘤模型。该数据集还提供了类似医学图像的分割掩码文件，支持深入分析。此外，数据集包含在八个稳态流速（0.001至0.004 kg/s）下测量的血流动力学数据，如流速、压力和壁面剪切应力等关键参数，为研究动脉瘤的发病机制和临床预测提供了宝贵资源。这一数据集将有助于推进对颅内动脉瘤病理特征和血流动力学机制的理解，并支持相关领域的深入研究。

链接: https://arxiv.org/abs/2501.09980
作者: Xigui Li,Yuanye Zhou,Feiyang Xiao,Xin Guo,Yichi Zhang,Chen Jiang,Jianchao Ge,Xiansheng Wang,Qimeng Wang,Taiwei Zhang,Chensen Lin,Yuan Cheng,Yuan Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intracranial aneurysm (IA) is a common cerebrovascular disease that is usually asymptomatic but may cause severe subarachnoid hemorrhage (SAH) if ruptured. Although clinical practice is usually based on individual factors and morphological features of the aneurysm, its pathophysiology and hemodynamic mechanisms remain controversial. To address the limitations of current research, this study constructed a comprehensive hemodynamic dataset of intracranial aneurysms. The dataset is based on 466 real aneurysm models, and 10,000 synthetic models were generated by resection and deformation operations, including 466 aneurysm-free models and 9,534 deformed aneurysm models. The dataset also provides medical image-like segmentation mask files to support insightful analysis. In addition, the dataset contains hemodynamic data measured at eight steady-state flow rates (0.001 to 0.004 kg/s), including critical parameters such as flow velocity, pressure, and wall shear stress, providing a valuable resource for investigating aneurysm pathogenesis and clinical prediction. This dataset will help advance the understanding of the pathologic features and hemodynamic mechanisms of intracranial aneurysms and support in-depth research in related fields. Dataset hosted at this https URL.
zh

[CV-35] GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor

【速读】：该论文旨在解决在可动画的4D高斯头像（animatable 4D Gaussian avatars）中进行文本驱动编辑时面临的运动遮挡（motion occlusion）和时空不一致性（spatial-temporal inconsistency）问题。与静态3D高斯编辑不同，4D高斯头像的编辑需要处理动态变化中的遮挡和一致性挑战。为此，论文提出了加权阿尔法混合方程（Weighted Alpha Blending Equation, WABE），该方程通过增强可见高斯的混合权重并抑制非可见高斯的影响，有效解决了运动遮挡问题。此外，为了提升编辑质量并确保4D一致性，论文引入了条件对抗学习（conditional adversarial learning）策略，进一步优化编辑结果并保持动画的连贯性。通过这些方法，GaussianAvatar-Editor在可动画的4D高斯编辑中实现了逼真且一致的结果。

链接: https://arxiv.org/abs/2501.09978
作者: Xiangyue Liu,Kunming Luo,Heng Li,Qi Zhang,Yuan Liu,Li Yi,Ping Tan
机构: Hong Kong University of Science and Technology (香港科技大学); Tencent AI Lab (腾讯人工智能实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 3DV 2025. [Project Link]( this https URL )

点击查看摘要

Abstract:We introduce GaussianAvatar-Editor, an innovative framework for text-driven editing of animatable Gaussian head avatars that can be fully controlled in expression, pose, and viewpoint. Unlike static 3D Gaussian editing, editing animatable 4D Gaussian avatars presents challenges related to motion occlusion and spatial-temporal inconsistency. To address these issues, we propose the Weighted Alpha Blending Equation (WABE). This function enhances the blending weight of visible Gaussians while suppressing the influence on non-visible Gaussians, effectively handling motion occlusion during editing. Furthermore, to improve editing quality and ensure 4D consistency, we incorporate conditional adversarial learning into the editing process. This strategy helps to refine the edited results and maintain consistency throughout the animation. By integrating these methods, our GaussianAvatar-Editor achieves photorealistic and consistent results in animatable 4D Gaussian editing. We conduct comprehensive experiments across various subjects to validate the effectiveness of our proposed techniques, which demonstrates the superiority of our approach over existing methods. More results and code are available at: [Project Link](this https URL).
zh

[CV-36] Explainable artificial intelligence (XAI): from inherent explainability to large language models

【速读】：该论文试图解决人工智能（AI）系统在决策过程中缺乏透明性和可解释性的问题，特别是在医疗和自动驾驶等关键任务领域，这种不透明性阻碍了用户对机器学习系统的信任和实际应用。解决方案的关键在于可解释人工智能（Explainable AI, XAI）技术，这些技术通过增强机器学习模型的可解释性或可解释性，使用户能够理解模型的决策依据，从而避免不良行为。论文详细介绍了从固有可解释模型到现代黑箱模型（如大语言模型和视觉语言模型）的可解释性方法的进展，并探讨了如何利用这些模型来自动化或改进其他机器学习模型的可解释性。此外，论文还通过定性和定量比较，展示了不同方法的优劣，并指出了未来研究的关键挑战和方向。

链接: https://arxiv.org/abs/2501.09967
作者: Fuseini Mumuni,Alhassan Mumuni
机构: University of Mines and Technology (UMaT), Tarkwa, Ghana; Cape Coast Technical University, Cape Coast, Ghana
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has continued to achieve tremendous success in recent times. However, the decision logic of these frameworks is often not transparent, making it difficult for stakeholders to understand, interpret or explain their behavior. This limitation hinders trust in machine learning systems and causes a general reluctance towards their adoption in practical applications, particularly in mission-critical domains like healthcare and autonomous driving. Explainable AI (XAI) techniques facilitate the explainability or interpretability of machine learning models, enabling users to discern the basis of the decision and possibly avert undesirable behavior. This comprehensive survey details the advancements of explainable AI methods, from inherently interpretable models to modern approaches for achieving interpretability of various black box models, including large language models (LLMs). Additionally, we review explainable AI techniques that leverage LLM and vision-language model (VLM) frameworks to automate or improve the explainability of other machine learning models. The use of LLM and VLM as interpretability methods particularly enables high-level, semantically meaningful explanations of model decisions and behavior. Throughout the paper, we highlight the scientific principles, strengths and weaknesses of state-of-the-art methods and outline different areas of improvement. Where appropriate, we also present qualitative and quantitative comparison results of various methods to show how they compare. Finally, we discuss the key challenges of XAI and directions for future research.
zh

[CV-37] Discrete Prior-based Temporal-coherent Content Prediction for Blind Face Video Restoration

【速读】：该论文致力于解决盲人脸视频恢复（Blind Face Video Restoration）中的关键问题，即从遭受复杂且未知退化的视频中恢复高保真细节。这一任务面临的主要挑战在于处理时间异质性（temporal heterogeneity）的同时保持稳定的面部属性。为解决这一问题，论文提出了一种基于离散先验的时间一致性内容预测变换器（Discrete Prior-based Temporal-Coherent Content Prediction Transformer），简称DP-TempCoh。其关键解决方案包括两个核心模块：1）空间-时间感知内容预测模块（Spatial-Temporal-Aware Content Prediction Module），通过离散视觉先验（discrete visual priors）从退化视频标记中合成高质量内容；2）运动统计调制模块（Motion Statistics Modulation Module），基于离散运动先验（discrete motion priors）调整内容，使其跨帧均值和方差与真实视频的统计特性保持一致，从而增强预测内容的时间一致性。实验结果表明，DP-TempCoh在合成和自然退化视频恢复中均表现出优越性能。

链接: https://arxiv.org/abs/2501.09960
作者: Lianxin Xie,Bingbing Zheng,Wen Xue,Yunfei Zhang,Le Jiang,Ruotao Xu,Si Wu,Hau-San Wong
机构: 1. 未知; 2. 未知; 3. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind face video restoration aims to restore high-fidelity details from videos subjected to complex and unknown degradations. This task poses a significant challenge of managing temporal heterogeneity while at the same time maintaining stable face attributes. In this paper, we introduce a Discrete Prior-based Temporal-Coherent content prediction transformer to address the challenge, and our model is referred to as DP-TempCoh. Specifically, we incorporate a spatial-temporal-aware content prediction module to synthesize high-quality content from discrete visual priors, conditioned on degraded video tokens. To further enhance the temporal coherence of the predicted content, a motion statistics modulation module is designed to adjust the content, based on discrete motion priors in terms of cross-frame mean and variance. As a result, the statistics of the predicted content can match with that of real videos over time. By performing extensive experiments, we verify the effectiveness of the design elements and demonstrate the superior performance of our DP-TempCoh in both synthetically and naturally degraded video restoration.
zh

[CV-38] Surface-SOS: Self-Supervised Object Segmentation via Neural Surface Representation

【速读】：该论文旨在解决无标注条件下的自监督物体分割（Self-supervised Object Segmentation, SOS）问题，特别是在多相机输入场景中，如何利用视图间的结构、纹理和几何一致性来实现细粒度的物体分割。论文提出的解决方案关键是一种基于表面表示的自监督物体分割框架（Surface-SOS），该框架通过多视角图像生成三维表面表示，从而为每个视图进行物体分割。具体而言，Surface-SOS采用了一种新颖的场景表示方案，将场景分解为两个互补的神经表示模块，分别使用有符号距离函数（Signed Distance Function, SDF）来建模复杂场景的高质量几何表面。此外，Surface-SOS通过引入粗分割掩码作为额外输入，能够利用多视角无标注图像来优化单视角分割结果。该方法首次利用神经表面表示，摆脱了对大量标注数据和强约束条件的依赖，这些约束通常包括在静态背景下观察目标物体或依赖视频中的时序监督。实验结果表明，Surface-SOS在多个标准基准数据集上均优于基于NeRF的方法，并显著超越了有监督的单视角基线方法。

链接: https://arxiv.org/abs/2501.09947
作者: Xiaoyun Zheng,Liwei Liao,Jianbo Jiao,Feng Gao,Ronggang Wang
机构: IEEE Publication Technology Group (IEEE出版技术组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TIP

点击查看摘要

Abstract:Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: this https URL.
zh

[CV-39] A Multi-Scale Feature Extraction and Fusion Deep Learning Method for Classification of Wheat Diseases

【速读】：该论文旨在解决小麦病害的识别和分类问题，特别是针对小麦散黑穗病（wheat loose smut）、叶锈病（leaf rust）以及冠腐病和根腐病（crown and root rot）等病害。这些病害对小麦的生长和产量造成了显著的负面影响。论文提出了一种创新的解决方案，通过整合多尺度特征提取（multi-scale feature extraction）和先进的图像分割技术（image segmentation techniques），结合神经网络模型（如Xception、Inception V3和ResNet 50）以及集成机器学习分类器（如投票法和堆叠法），显著提高了小麦病害分类的准确性。研究结果表明，该方法在分类准确率上达到了99.75%，优于现有的先进方法，其中Xception模型表现最为突出。

链接: https://arxiv.org/abs/2501.09938
作者: Sajjad Saleem,Adil Hussain,Nabila Majeed,Zahid Akhtar,Kamran Siddique
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Wheat is an important source of dietary fiber and protein that is negatively impacted by a number of risks to its growth. The difficulty of identifying and classifying wheat diseases is discussed with an emphasis on wheat loose smut, leaf rust, and crown and root rot. Addressing conditions like crown and root rot, this study introduces an innovative approach that integrates multi-scale feature extraction with advanced image segmentation techniques to enhance classification accuracy. The proposed method uses neural network models Xception, Inception V3, and ResNet 50 to train on a large wheat disease classification dataset 2020 in conjunction with an ensemble of machine vision classifiers, including voting and stacking. The study shows that the suggested methodology has a superior accuracy of 99.75% in the classification of wheat diseases when compared to current state-of-the-art approaches. A deep learning ensemble model Xception showed the highest accuracy.
zh

[CV-40] IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment

【速读】：该论文旨在解决文本驱动图像编辑（text-driven image editing）任务中准确评估编辑后图像的挑战。与文本驱动图像生成（text-driven image generation）不同，文本驱动图像编辑同时依赖于文本和源图像，编辑后的图像通常与原始图像保持内在联系，并随文本语义动态变化。然而，现有方法往往仅关注文本-图像对齐（text-image alignment）或未能与人类感知（human perception）保持一致。为此，论文提出了文本驱动图像编辑基准套件（IE-Bench），包括多样化的源图像、编辑提示、不同编辑方法的结果，以及由25名人类受试者提供的3,010个平均意见分数（Mean Opinion Scores, MOS）。此外，论文还引入了IE-QA，一种多模态源感知质量评估方法（multi-modality source-aware quality assessment method），专门用于文本驱动图像编辑。IE-Bench是首个针对文本驱动图像编辑的图像质量评估（IQA）数据集和模型，实验表明IE-QA在主观对齐方面优于现有评估指标。

链接: https://arxiv.org/abs/2501.09927
作者: Shangkun Sun,Bowen Qu,Xiaoyu Liang,Songlin Fan,Wei Gao
机构: Peking University(北京大学); PengCheng Laboratary(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding results different editing methods, and total 3,010 Mean Opinion Scores (MOS) provided by 25 human subjects. Furthermore, we introduce IE-QA, a multi-modality source-aware quality assessment method for text-driven image editing. To the best of our knowledge, IE-Bench offers the first IQA dataset and model tailored for text-driven image editing. Extensive experiments demonstrate IE-QA’s superior subjective-alignments on the text-driven image editing task compared with previous metrics. We will make all related data and code available to the public.
zh

[CV-41] ForestProtector: An IoT Architecture Integrating Machine Vision and Deep Reinforcement Learning for Efficient Wildfire Monitoring

【速读】：该论文旨在解决森林火灾早期检测的难题，特别是现有基于新技术（如遥感、PTZ摄像头、无人机等）的火灾检测系统成本高昂且需要人工干预，难以实现大面积区域的持续监控。为解决这一问题，论文提出了一种低成本的森林火灾检测系统，其关键在于利用具有计算机视觉能力的中央网关设备，实现对远距离360°视野内的烟雾进行监控。此外，系统通过深度强化学习代理动态控制摄像头方向，并结合分布式物联网设备提供的实时传感器数据（如烟雾浓度、环境温度和湿度），从而在大范围区域内实现自动化的火灾监控，同时减少误报率。

链接: https://arxiv.org/abs/2501.09926
作者: Kenneth Bonilla-Ormachea,Horacio Cuizaga,Edwin Salcedo,Sebastian Castro,Sergio Fernandez-Testa,Misael Mamani
机构: Centro de Investigación, Desarrollo e Innovación en Ingeniería Mecatrónica (机械电子工程研究、开发与创新中心); Universidad Católica Boliviana “San Pablo” (玻利维亚圣保罗天主教大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in the proceedings of the 11th International Conference on Automation, Robotics, and Applications (ICARA 2025)

点击查看摘要

Abstract:Early detection of forest fires is crucial to minimizing the environmental and socioeconomic damage they cause. Indeed, a fire’s duration directly correlates with the difficulty and cost of extinguishing it. For instance, a fire burning for 1 minute might require 1 liter of water to extinguish, while a 2-minute fire could demand 100 liters, and a 10-minute fire might necessitate 1,000 liters. On the other hand, existing fire detection systems based on novel technologies (e.g., remote sensing, PTZ cameras, UAVs) are often expensive and require human intervention, making continuous monitoring of large areas impractical. To address this challenge, this work proposes a low-cost forest fire detection system that utilizes a central gateway device with computer vision capabilities to monitor a 360° field of view for smoke at long distances. A deep reinforcement learning agent enhances surveillance by dynamically controlling the camera’s orientation, leveraging real-time sensor data (smoke levels, ambient temperature, and humidity) from distributed IoT devices. This approach enables automated wildfire monitoring across expansive areas while reducing false positives.
zh

[CV-42] alkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation

【速读】：该论文试图解决语音驱动的3D面部动画中一个被忽视的关键问题：如何从语音生成自然的3D眼动（eye gaze）动画。由于语音与眼动之间的相关性较弱，且缺乏相关的音频-眼动数据，仅从语音生成3D眼动动画具有较大挑战性。论文提出了一种新颖的数据驱动方法，通过构建一个包含约14小时高质量音频-网格序列的数据集，该数据集同时捕捉了眼动、头部运动和面部运动。在此基础上，论文设计了一个语音到运动转换框架，将头部运动和眼动分别建模在两个独立的潜在空间（latent spaces）中，以减弱语音与非语言运动之间弱相关性带来的建模难度。最终，该方法能够从语音中合成眼动、眨眼、头部运动和面部运动，生成多样且自然的3D眼动动画。

链接: https://arxiv.org/abs/2501.09921
作者: Yixiang Zhuang,Chunshan Ma,Yao Cheng,Xuan Cheng,Jing Liao,Juncong Lin
机构: School of Informatics, Xiamen University (厦门大学信息学院); China Mobile (Hangzhou) Information Technology Co., Ltd. (中国移动（杭州）信息技术有限公司); Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: this https URL
zh

[CV-43] SLIM: Sim-to-Real Legged Instructive Manipulation via Long-Horizon Visuomotor Learning

【速读】：该论文旨在解决低成本四足机器人（quadruped）在长时程（long-horizon）现实任务中的操作问题，特别是在视觉-移动操作（visual-mobile manipulation）任务中的挑战。解决方案的关键包括：1）采用分层设计，将高层策略（high-level policy）用于视觉-移动操作指令的跟随，低层策略（low-level policy）用于四足机器人的运动和肢体控制；2）引入渐进式策略扩展方法（progressive policy expansion approach）结合教师-学生框架（teacher-student framework），以高效训练高层视觉运动策略；3）采用一系列技术来最小化仿真到现实的差距（sim-to-real gaps）。通过这些方法，系统在仿真环境中完全训练后，能够在各种室内外场景和光照条件下实现流畅的仿真到现实迁移，并在长时程移动操作任务中表现出较高的任务成功率和执行效率。

链接: https://arxiv.org/abs/2501.09905
作者: Haichao Zhang,Haonan Yu,Le Zhao,Andrew Choi,Qinxun Bai,Yiqing Yang,Wei Xu
机构: Horizon Robotics
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a low-cost quadruped manipulation system that solves long-horizon real-world tasks, trained by reinforcement learning purely in simulation. The system comprises 1) a hierarchical design of a high-level policy for visual-mobile manipulation following instructions, and a low-level policy for quadruped movement and limb-control, 2) a progressive policy expansion approach for solving the long-horizon task together with a teacher-student framework for efficient high-level training of the high-level visuomotor policy, and 3) a suite of techniques for minimizing sim-to-real gaps. With budget-friendly but limited reliability and performance hardware, and just one wrist-mounted RGB camera, the entire system fully trained in simulation achieves high success rates for long horizon tasks involving search, move, grasp, and drop-into, with fluid sim-to-real transfer in a wide variety of indoor and outdoor scenes and lighting this http URL real-world evaluations show that on the long horizon mobile manipulation tasks, our system achieves good performance when transferred to real both in terms of task success rate and execution efficiency. Finally, we discuss the necessity of our sim-to-real techniques for legged mobile manipulation, and show their ablation performance. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.09905 [cs.RO] (or arXiv:2501.09905v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2501.09905 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-44] FoundationStereo: Zero-Shot Stereo Matching

【速读】：该论文旨在解决深度立体匹配（stereo matching）在零样本泛化（zero-shot generalization）方面的挑战。尽管在基准数据集上通过领域微调（per-domain fine-tuning）取得了显著进展，但在零样本泛化方面仍然存在困难，而零样本泛化是其他计算机视觉任务中基础模型（foundation models）的标志性能力。为此，论文提出了FoundationStereo，一个用于立体深度估计的基础模型，旨在实现强大的零样本泛化能力。

解决方案的关键包括：1）构建了一个大规模（100万对立体图像）且具有高度多样性和真实感的合成训练数据集，并通过自动自筛选管道（automatic self-curation pipeline）去除模糊样本；2）设计了多个网络架构组件以增强可扩展性，包括一个侧调谐特征骨干网络（side-tuning feature backbone），该网络利用视觉基础模型中的单目先验知识（monocular priors）来缓解模拟到现实的差距（sim-to-real gap）；3）引入了长程上下文推理（long-range context reasoning）以有效过滤成本体积（cost volume）。这些组件共同作用，使得模型在不同领域表现出强大的鲁棒性和准确性，从而在零样本立体深度估计中确立了新的标准。

链接: https://arxiv.org/abs/2501.09898
作者: Bowen Wen,Matthew Trepte,Joseph Aribido,Jan Kautz,Orazio Gallo,Stan Birchfield
机构: NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation.
zh

[CV-45] FLORA: Formal Language Model Enables Robust Training-free Zero-shot Object Referring Analysis

【速读】：该论文试图解决的是对象指代分析（Object Referring Analysis, ORA）中的零样本学习问题。传统的对象指代分析依赖于大量标注数据和耗时的训练过程，而本文提出了一种无需训练的零样本对象指代分析框架，称为FLORA（Formal Language for Object Referring and Analysis）。FLORA的关键在于利用大语言模型（LLMs）的推理能力，并结合形式语言模型（Formal Language Model, FLM），通过逻辑驱动的描述解析来实现零样本对象指代分析。FLM通过结构化的、基于规则的描述来规范语言，从而在不依赖训练过程的情况下实现对对象描述的有效解释。此外，FLORA还引入了贝叶斯推理框架和现成的解释模型，进一步提升了推理的鲁棒性，并显著提高了现有预训练检测器的零样本性能，提升了约45%。实验结果表明，FLORA在多个具有挑战性的数据集上均优于当前最先进的零样本方法，展示了其在检测和分割任务中的优越性。

链接: https://arxiv.org/abs/2501.09887
作者: Zhe Chen,Zijing Chen
机构: Cisco-La Trobe Centre for Artificial Intelligence and Internet of Things (思科-拉筹伯人工智能与物联网中心); Department of Computer Science and Information Technology, La Trobe University (拉筹伯大学计算机科学与信息技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object Referring Analysis (ORA), commonly known as referring expression comprehension, requires the identification and localization of specific objects in an image based on natural descriptions. Unlike generic object detection, ORA requires both accurate language understanding and precise visual localization, making it inherently more complex. Although recent pre-trained large visual grounding detectors have achieved significant progress, they heavily rely on extensively labeled data and time-consuming learning. To address these, we introduce a novel, training-free framework for zero-shot ORA, termed FLORA (Formal Language for Object Referring and Analysis). FLORA harnesses the inherent reasoning capabilities of large language models (LLMs) and integrates a formal language model - a logical framework that regulates language within structured, rule-based descriptions - to provide effective zero-shot ORA. More specifically, our formal language model (FLM) enables an effective, logic-driven interpretation of object descriptions without necessitating any training processes. Built upon FLM-regulated LLM outputs, we further devise a Bayesian inference framework and employ appropriate off-the-shelf interpretive models to finalize the reasoning, delivering favorable robustness against LLM hallucinations and compelling ORA performance in a training-free manner. In practice, our FLORA boosts the zero-shot performance of existing pretrained grounding detectors by up to around 45%. Our comprehensive evaluation across different challenging datasets also confirms that FLORA consistently surpasses current state-of-the-art zero-shot methods in both detection and segmentation tasks associated with zero-shot ORA. We believe our probabilistic parsing and reasoning of the LLM outputs elevate the reliability and interpretability of zero-shot ORA. We shall release codes upon publication.
zh

[CV-46] Semi-Supervised Image-Based Narrative Extraction: A Case Study with Historical Photographic Records ECIR2025

【速读】：该论文旨在解决从历史摄影记录中提取叙事（narrative）的问题，特别是如何从大规模图像集合中生成有意义的历史叙事。解决方案的关键在于将原本用于文本的无监督叙事地图算法（narrative maps algorithm）扩展到图像数据，利用深度学习技术进行视觉特征提取和相似性计算。具体来说，该方法通过动态时间规整算法（Dynamic Time Warping, DTW）将算法提取的视觉叙事与专家策划的时间线进行匹配，并通过专家评估验证其历史准确性和连贯性。研究结果表明，该方法在较长时间线（10张以上图像）上优于随机采样，为历史学家、档案管理员和数字人文研究者提供了新的工具，以探索和理解大规模图像集合中的历史事件。

链接: https://arxiv.org/abs/2501.09884
作者: Fausto German,Brian Keith,Mauricio Matus,Diego Urrutia,Claudio Meneses
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: This paper has been accepted for oral presentation in the findings track of the 47th European Conference on Information Retrieval (ECIR 2025). Source code and experiments are available at this https URL

点击查看摘要

Abstract:This paper presents a semi-supervised approach to extracting narratives from historical photographic records using an adaptation of the narrative maps algorithm. We extend the original unsupervised text-based method to work with image data, leveraging deep learning techniques for visual feature extraction and similarity computation. Our method is applied to the ROGER dataset, a collection of photographs from the 1928 Sacambaya Expedition in Bolivia captured by Robert Gerstmann. We compare our algorithmically extracted visual narratives with expert-curated timelines of varying lengths (5 to 30 images) to evaluate the effectiveness of our approach. In particular, we use the Dynamic Time Warping (DTW) algorithm to match the extracted narratives with the expert-curated baseline. In addition, we asked an expert on the topic to qualitatively evaluate a representative example of the resulting narratives. Our findings show that the narrative maps approach generally outperforms random sampling for longer timelines (10+ images, p 0.05), with expert evaluation confirming the historical accuracy and coherence of the extracted narratives. This research contributes to the field of computational analysis of visual cultural heritage, offering new tools for historians, archivists, and digital humanities scholars to explore and understand large-scale image collections. The method’s ability to generate meaningful narratives from visual data opens up new possibilities for the study and interpretation of historical events through photographic evidence.
zh

[CV-47] ASTRA: A Scene-aware TRAnsformer-based model for trajectory prediction

【速读】：该论文旨在解决行人轨迹预测问题，特别是在复杂场景中如何准确预测行人未来轨迹。解决方案的关键在于提出了ASTRA模型，该模型通过整合场景上下文（scene context）、空间动态（spatial dynamics）、社交交互（social inter-agent interactions）和时间进程（temporal progressions）来实现精确预测。具体而言，ASTRA采用了基于U-Net的特征提取器来捕捉场景表示，并通过图感知的Transformer编码器（graph-aware transformer encoder）来捕捉社交交互。这些组件被集成以学习代理-场景感知嵌入（agent-scene aware embedding），从而使模型能够学习空间动态并预测行人未来轨迹。此外，ASTRA通过引入条件变分自编码器（CVAE）生成随机预测，并提出了一种简单但有效的加权惩罚损失函数（weighted penalty loss function），以提升预测性能。实验结果表明，ASTRA在ETH-UCY和PIE数据集上分别实现了27%/10%和26%的平均性能提升，且参数数量仅为现有最先进模型的七分之一。

链接: https://arxiv.org/abs/2501.09878
作者: Izzeddin Teeti,Aniket Thomas,Munish Monga,Sachin Kumar,Uddeshya Singh,Andrew Bradley,Biplab Banerjee,Fabio Cuzzolin
机构: Visual Artificial Intelligence Laboratory, Oxford Brookes University (牛津布鲁克斯大学); Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present ASTRA (A Scene-aware TRAnsformer-based model for trajectory prediction), a light-weight pedestrian trajectory forecasting model that integrates the scene context, spatial dynamics, social inter-agent interactions and temporal progressions for precise forecasting. We utilised a U-Net-based feature extractor, via its latent vector representation, to capture scene representations and a graph-aware transformer encoder for capturing social interactions. These components are integrated to learn an agent-scene aware embedding, enabling the model to learn spatial dynamics and forecast the future trajectory of pedestrians. The model is designed to produce both deterministic and stochastic outcomes, with the stochastic predictions being generated by incorporating a Conditional Variational Auto-Encoder (CVAE). ASTRA also proposes a simple yet effective weighted penalty loss function, which helps to yield predictions that outperform a wide array of state-of-the-art deterministic and generative models. ASTRA demonstrates an average improvement of 27%/10% in deterministic/stochastic settings on the ETH-UCY dataset, and 26% improvement on the PIE dataset, respectively, along with seven times fewer parameters than the existing state-of-the-art model (see Figure 1). Additionally, the model’s versatility allows it to generalize across different perspectives, such as Bird’s Eye View (BEV) and Ego-Vehicle View (EVV).
zh

[CV-48] CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation WACV

【速读】：该论文试图解决多模态地理空间成像（Geospatial imaging）中场景理解的几何解释问题，特别是在缺乏精确地面真实数据的情况下。现有的多模态数据（如EO、SAR和LiDAR）虽然提供了丰富的场景信息，但由于不同模态之间的异质性，准确解释几何结构仍然具有挑战性。为此，论文提出了CrossModalityDiffusion框架，旨在无需先验场景几何知识的情况下，生成跨不同模态和视角的图像。该框架的关键在于使用模态特定的编码器（modality-specific encoders）将多张输入图像转换为几何感知的特征体积（geometry-aware feature volumes），这些特征体积在共享空间中统一不同模态的输入。通过体积渲染技术（volumetric rendering techniques），特征体积被重叠并渲染为从新视角生成的特征图像，这些特征图像随后作为模态特定的扩散模型（diffusion model）的条件输入，从而合成目标模态的新图像。通过联合训练不同模块，该框架确保了跨模态的几何一致性理解，并在ShapeNet cars数据集上验证了其有效性。

链接: https://arxiv.org/abs/2501.09838
作者: Alex Berian,Daniel Brignac,JhihYang Wu,Natnael Daba,Abhijit Mahalanobis
机构: University of Arizona: ECE Dept. (亚利桑那大学: 电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted in the 2025 WACV workshop GeoCV

点击查看摘要

Abstract:Geospatial imaging leverages data from diverse sensing modalities-such as EO, SAR, and LiDAR, ranging from ground-level drones to satellite views. These heterogeneous inputs offer significant opportunities for scene understanding but present challenges in interpreting geometry accurately, particularly in the absence of precise ground truth data. To address this, we propose CrossModalityDiffusion, a modular framework designed to generate images across different modalities and viewpoints without prior knowledge of scene geometry. CrossModalityDiffusion employs modality-specific encoders that take multiple input images and produce geometry-aware feature volumes that encode scene structure relative to their input camera positions. The space where the feature volumes are placed acts as a common ground for unifying input modalities. These feature volumes are overlapped and rendered into feature images from novel perspectives using volumetric rendering techniques. The rendered feature images are used as conditioning inputs for a modality-specific diffusion model, enabling the synthesis of novel images for the desired output modality. In this paper, we show that jointly training different modules ensures consistent geometric understanding across all modalities within the framework. We validate CrossModalityDiffusion’s capabilities on the synthetic ShapeNet cars dataset, demonstrating its effectiveness in generating accurate and consistent novel views across multiple imaging modalities and perspectives.
zh

[CV-49] EraseBench: Understanding The Ripple Effects of Concept Erasure Techniques

【速读】：该论文试图解决当前概念擦除技术（concept erasure techniques）在实际应用中的鲁棒性和部署准备度问题。尽管这些技术在受控场景中表现出一定的成功，但其在真实世界应用中的表现仍存在不确定性，尤其是在处理视觉相似、二项式（binomial）和语义相关概念时，容易引发概念纠缠（concept entanglement）现象，导致图像质量下降。为了解决这一问题，论文提出了EraseBENCH，一个多维度的基准测试工具，旨在通过包含100多个多样化概念和1000多个定制提示的数据集，结合全面的评估指标，更深入地评估概念擦除方法的有效性。研究结果表明，即使是当前最先进的技术在擦除后也难以保持图像质量，表明这些方法尚未准备好用于实际部署。这一发现揭示了概念擦除技术在可靠性上的差距。

链接: https://arxiv.org/abs/2501.09833
作者: Ibtihel Amara,Ahmed Imtiaz Humayun,Ivana Kajic,Zarana Parekh,Natalie Harris,Sarah Young,Chirag Nagpal,Najoung Kim,Junfeng He,Cristina Nader Vasconcelos,Deepak Ramachandran,Goolnoosh Farnadi,Katherine Heller,Mohammad Havaei,Negar Rostamzadeh
机构: Google Research; McGill University (麦吉尔大学); Rice University (莱斯大学); Google Deepmind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages main; 9 pages supplemental material

点击查看摘要

Abstract:Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate success in controlled scenarios, their robustness in real-world applications and readiness for deployment remain uncertain. In this work, we identify a critical gap in evaluating sanitized models, particularly in terms of their performance across various concept dimensions. We systematically investigate the failure modes of current concept erasure techniques, with a focus on visually similar, binomial, and semantically related concepts. We propose that these interconnected relationships give rise to a phenomenon of concept entanglement resulting in ripple effects and degradation in image quality. To facilitate more comprehensive evaluation, we introduce EraseBENCH, a multi-dimensional benchmark designed to assess concept erasure methods with greater depth. Our dataset includes over 100 diverse concepts and more than 1,000 tailored prompts, paired with a comprehensive suite of metrics that together offer a holistic view of erasure efficacy. Our findings reveal that even state-of-the-art techniques struggle with maintaining quality post-erasure, indicating that these approaches are not yet ready for real-world deployment. This highlights the gap in reliability of the concept erasure techniques.
zh

[CV-50] PIXELS: Progressive Image Xemplar-based Editing with Latent Surgery

【速读】：该论文试图解决当前基于语言引导的扩散模型（diffusion models）在图像编辑中面临的挑战，即用户需要通过繁琐的提示工程（prompt engineering）来精确描述所需的修改。现有基于示例（exemplar-based）的编辑方法通常依赖于定制的目标函数进行训练，这不仅需要大量计算资源，还限制了用户对编辑区域的细粒度控制，仅能实现全局的均匀修改。

论文提出的解决方案是PIXELS框架，该框架利用现成的扩散模型，在推理阶段实现渐进式示例驱动的图像编辑。PIXELS的关键在于提供了像素或区域级别的细粒度控制，允许用户从动态数量的参考图像或多模态提示中获取灵感，并逐步整合所有所需的修改，而无需重新训练或微调现有的文本到图像（TTI）模型。这种方法不仅提高了编辑效率和质量，还支持对单个对象的选择性修改和逐步的空间变化，显著提升了定量指标和人类评估结果。

链接: https://arxiv.org/abs/2501.09826
作者: Shristi Das Biswas,Matthew Shreve,Xuelu Li,Prateek Singhal,Kaushik Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in language-guided diffusion models for image editing are often bottle-necked by cumbersome prompt engineering to precisely articulate desired changes. An intuitive alternative calls on guidance from in-the-wild image exemplars to help users bring their imagined edits to life. Contemporary exemplar-based editing methods shy away from leveraging the rich latent space learnt by pre-existing large text-to-image (TTI) models and fall back on training with curated objective functions to achieve the task. Though somewhat effective, this demands significant computational resources and lacks compatibility with diverse base models and arbitrary exemplar count. On further investigation, we also find that these techniques restrict user control to only applying uniform global changes over the entire edited region. In this paper, we introduce a novel framework for progressive exemplar-driven editing with off-the-shelf diffusion models, dubbed PIXELS, to enable customization by providing granular control over edits, allowing adjustments at the pixel or region level. Our method operates solely during inference to facilitate imitative editing, enabling users to draw inspiration from a dynamic number of reference images, or multimodal prompts, and progressively incorporate all the desired changes without retraining or fine-tuning existing TTI models. This capability of fine-grained control opens up a range of new possibilities, including selective modification of individual objects and specifying gradual spatial changes. We demonstrate that PIXELS delivers high-quality edits efficiently, leading to a notable improvement in quantitative metrics as well as human evaluation. By making high-quality image editing more accessible, PIXELS has the potential to enable professional-grade edits to a wider audience with the ease of using any open-source image generation model.
zh

[CV-51] Generalized Single-Image-Based Morphing Attack Detection Using Deep Representations from Vision Transformer

【速读】：该论文旨在解决面部融合攻击（Face Morphing Attacks）对应用于边境控制和护照签发等场景的面部识别系统（FRS）带来的严重威胁。为了解决这一问题，论文提出了一种基于单图像的面部融合攻击检测算法（S-MAD），该算法的关键在于利用视觉Transformer（ViT）架构进行编码学习。与传统的卷积神经网络（CNN）架构相比，ViT模型能够更好地整合局部和全局信息，从而有效检测分布在面部区域的融合痕迹。通过在公开的FRGC面部数据集上进行广泛实验，论文验证了所提出的S-MAD方法在跨数据集测试（训练和测试使用不同数据）中的检测性能优于现有方法，并在数据集内测试（训练和测试使用相同数据）中表现出与现有方法相当的性能。

链接: https://arxiv.org/abs/2501.09817
作者: Haoyu Zhang,Raghavendra Ramachandra,Kiran Raja,Christoph Busch
机构: Norwegian University of Science and Technology, Norway(挪威科技大学); Darmstadt University of Applied Sciences, Germany(达姆施塔特应用科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face morphing attacks have posed severe threats to Face Recognition Systems (FRS), which are operated in border control and passport issuance use cases. Correspondingly, morphing attack detection algorithms (MAD) are needed to defend against such attacks. MAD approaches must be robust enough to handle unknown attacks in an open-set scenario where attacks can originate from various morphing generation algorithms, post-processing and the diversity of printers/scanners. The problem of generalization is further pronounced when the detection has to be made on a single suspected image. In this paper, we propose a generalized single-image-based MAD (S-MAD) algorithm by learning the encoding from Vision Transformer (ViT) architecture. Compared to CNN-based architectures, ViT model has the advantage on integrating local and global information and hence can be suitable to detect the morphing traces widely distributed among the face region. Extensive experiments are carried out on face morphing datasets generated using publicly available FRGC face datasets. Several state-of-the-art (SOTA) MAD algorithms, including representative ones that have been publicly evaluated, have been selected and benchmarked with our ViT-based approach. Obtained results demonstrate the improved detection performance of the proposed S-MAD method on inter-dataset testing (when different data is used for training and testing) and comparable performance on intra-dataset testing (when the same data is used for training and testing) experimental protocol.
zh

[CV-52] Lossy Compression with Pretrained Diffusion Models

【速读】：该论文旨在解决利用预训练扩散模型（pretrained diffusion models）进行有损图像压缩（lossy image compression）时面临的挑战，特别是反向信道编码（reverse-channel coding）方面的困难。尽管自Ho等人2020年的研究以来，基于扩散模型的有损压缩算法在理论上已被理解，但由于技术障碍，这些算法一直未能完全实现。论文通过引入简单的变通方法，首次实现了DiffC算法的完整实现，能够在10秒内使用Stable Diffusion模型完成图像的压缩和解压缩。该解决方案的关键在于无需额外训练，即可在超低比特率（ultra-low bitrates）下与其他最先进的生成式压缩方法竞争。

链接: https://arxiv.org/abs/2501.09815
作者: Jeremy Vonderfecht,Feng Liu
机构: Portland State University(波特兰州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We apply the DiffC algorithm (Theis et al. 2022) to Stable Diffusion 1.5, 2.1, XL, and Flux-dev, and demonstrate that these pretrained models are remarkably capable lossy image compressors. A principled algorithm for lossy compression using pretrained diffusion models has been understood since at least Ho et al. 2020, but challenges in reverse-channel coding have prevented such algorithms from ever being fully implemented. We introduce simple workarounds that lead to the first complete implementation of DiffC, which is capable of compressing and decompressing images using Stable Diffusion in under 10 seconds. Despite requiring no additional training, our method is competitive with other state-of-the-art generative compression methods at low ultra-low bitrates.
zh

[CV-53] SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

【速读】：该论文试图解决的是在表达性人体姿态和形状估计（Expressive Human Pose and Shape Estimation, EHPS）领域中，现有方法局限于在有限数据集上训练创新架构设计的问题。为了解决这一问题，论文提出了通过数据扩展和模型扩展来构建通用基础模型（generalist foundation models）的方案。关键解决方案包括：1）在数据扩展方面，系统性地研究了40个EHPS数据集，优化了训练方案并选择了能够显著提升EHPS能力的数据集，最终在10M训练实例上实现了收益递减；2）在模型扩展方面，利用视觉Transformer（Vision Transformers, ViT）作为骨干网络，研究了模型规模对EHPS的影响，并基于两种极简架构（SMPLer-X和SMPLest-X）进行实验，排除了算法设计的影响。通过大数据和大模型的结合，基础模型在多个测试基准上表现出色，并展现出良好的迁移能力。此外，通过微调策略，通用模型可以进一步转化为专用模型，实现性能提升。最终，该基础模型在七个基准测试（如AGORA、UBody、EgoBody和SynHand数据集）上均取得了最先进的结果。

链接: https://arxiv.org/abs/2501.09782
作者: Wanqi Yin,Zhongang Cai,Ruisi Wang,Ailing Zeng,Chen Wei,Qingping Sun,Haiyi Mei,Yanjun Wang,Hui En Pang,Mingyuan Zhang,Lei Zhang,Chen Change Loy,Atsushi Yamashita,Lei Yang,Ziwei Liu
机构: Sensetime(商汤科技); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Robotics (cs.RO)
备注: An extension of SMPLer-X [ arXiv:2309.17448 ]. Homepage: this https URL

点击查看摘要

Abstract:Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: this https URL).
zh

[CV-54] VideoWorld: Exploring Knowledge Learning from Unlabeled Videos KR

【速读】：该论文探讨了深度生成模型是否能够仅通过视觉输入学习复杂知识，这与当前主要关注基于文本的模型（如大语言模型，LLMs）的研究方向形成对比。为了解决这一问题，作者开发了VideoWorld，一个基于自回归视频生成模型，该模型在未标注的视频数据上进行训练，并在视频围棋和机器人控制任务中测试其知识获取能力。研究的关键解决方案包括：（1）仅通过视频训练即可提供足够的信息来学习知识，包括规则、推理和规划能力；（2）视觉变化的表示对于知识获取至关重要。为了提高这一过程的效率和效果，作者引入了潜在动态模型（Latent Dynamics Model, LDM）作为VideoWorld的关键组件。实验结果表明，VideoWorld在仅使用3亿参数的模型下，无需依赖强化学习中典型的搜索算法或奖励机制，即可在Video-GoBench中达到5段专业水平，并在机器人任务中有效学习多种控制操作并在不同环境中泛化，接近CALVIN和RLBench中oracle模型的性能。

链接: https://arxiv.org/abs/2501.09781
作者: Zhongwei Ren,Yunchao Wei,Xun Guo,Yao Zhao,Bingyi Kang,Jiashi Feng,Xiaojie Jin
机构: Beijing Jiaotong University (北京交通大学); University of Science and Technology of China (中国科学技术大学); ByteDance Seed (字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and models are released at: this https URL

点击查看摘要

Abstract:This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.
zh

[CV-55] Robust Egoistic Rigid Body Localization

【速读】：该论文旨在解决一种鲁棒且自依赖的刚体定位（Rigid Body Localization, RBL）问题，其中主要刚体需要在没有外部基础设施支持、没有目标形状先验知识的情况下，估计另一个刚体（即“目标”）相对于自身的位姿（即位置和方向），并考虑到观测数据可能不完整的情况。论文提出了三种互补的解决方案：首先，提出了一种估计两个刚体中心点之间平移向量的方法，该方法不要求两个物体具有相同的形状或相同数量的标志点，且在完全信息条件下显著优于现有技术，但对数据缺失较为敏感。其次，提出了一种在信息不完整情况下性能更优的鲁棒替代方案，尽管在完全信息条件下会有轻微的性能损失。最后，提出了一种估计目标刚体相对于主要刚体的旋转矩阵的方案。通过对比现有技术，所提出的方法在完全信息和不完整条件下的均方根误差（RMSE）性能上表现出显著优势。

链接: https://arxiv.org/abs/2501.10219
作者: Niclas Führling,Giuseppe Thadeu Freitas de Abreu,David González G.,Osvaldo Gonsa
机构: Constructor University (康斯特鲁克特大学); Continental AG (大陆集团)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We consider a robust and self-reliant (or “egoistic”) variation of the rigid body localization (RBL) problem, in which a primary rigid body seeks to estimate the pose (i.e., location and orientation) of another rigid body (or “target”), relative to its own, without the assistance of external infrastructure, without prior knowledge of the shape of the target, and taking into account the possibility that the available observations are incomplete. Three complementary contributions are then offered for such a scenario. The first is a method to estimate the translation vector between the center point of both rigid bodies, which unlike existing techniques does not require that both objects have the same shape or even the same number of landmark points. This technique is shown to significantly outperform the state-of-the-art (SotA) under complete information, but to be sensitive to data erasures, even when enhanced by matrix completion methods. The second contribution, designed to offer improved performance in the presence of incomplete information, offers a robust alternative to the latter, at the expense of a slight relative loss under complete information. Finally, the third contribution is a scheme for the estimation of the rotation matrix describing the relative orientation of the target rigid body with respect to the primary. Comparisons of the proposed schemes and SotA techniques demonstrate the advantage of the contributed methods in terms of root mean square error (RMSE) performance under fully complete information and incomplete conditions.
zh

[CV-56] FECT: Classification of Breast Cancer Pathological Images Based on Fusion Features

【速读】：该论文旨在解决乳腺癌组织病理图像自动分类中的性能瓶颈问题。现有方法通常依赖于单一细胞或组织特征，缺乏对难以分类类别的形态学特征的设计考虑，导致分类性能不佳。为解决这些问题，论文提出了一种新颖的乳腺癌组织分类模型，即融合边缘、细胞和组织特征的FECT模型。该模型采用ResMTUNet和基于注意力的聚合器来提取和聚合这些特征。通过在BRACS数据集上的广泛测试，该模型在分类准确率和F1分数上均超越了当前先进方法。此外，由于其特征融合方式与病理学家的诊断方法一致，该模型具有较高的可解释性，并有望在未来临床应用中发挥重要作用。

链接: https://arxiv.org/abs/2501.10128
作者: Jiacheng Hao,Yiqing Liu,Siqi Zeng,Yonghong He
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast cancer is one of the most common cancers among women globally, with early diagnosis and precise classification being crucial. With the advancement of deep learning and computer vision, the automatic classification of breast tissue pathological images has emerged as a research focus. Existing methods typically rely on singular cell or tissue features and lack design considerations for morphological characteristics of challenging-to-classify categories, resulting in suboptimal classification performance. To address these problems, we proposes a novel breast cancer tissue classification model that Fused features of Edges, Cells, and Tissues (FECT), employing the ResMTUNet and an attention-based aggregator to extract and aggregate these features. Extensive testing on the BRACS dataset demonstrates that our model surpasses current advanced methods in terms of classification accuracy and F1 scores. Moreover, due to its feature fusion that aligns with the diagnostic approach of pathologists, our model exhibits interpretability and holds promise for significant roles in future clinical applications.
zh

[CV-57] Physics-informed DeepCT: Sinogram Wavelet Decomposition Meets Masked Diffusion

【速读】：该论文试图解决稀疏视角计算机断层扫描（SVCT）重建中，由于训练样本空间有限导致的模型泛化能力受限问题。具体表现为模型在面对不熟悉数据时，生成图像细节模糊且区域间不一致。为解决这一问题，论文提出了一种基于正弦图的随机小波分解和随机掩码扩散模型（SWARM）。其关键解决方案包括：1）在正弦图中引入随机掩码策略，有效扩展了有限的训练样本空间，使模型能够学习更广泛的数据分布，增强对数据不确定性的理解和泛化能力；2）对正弦图小波的高频分量应用随机训练策略，增强了特征表示能力，提升了模型在不同频段细节捕捉上的表现。此外，采用两阶段迭代重建方法，确保重建图像的全局一致性并优化细节。实验结果表明，SWARM在多个数据集上的定量和定性性能均优于现有方法。

链接: https://arxiv.org/abs/2501.09935
作者: Zekun Zhou,Tan Liu,Bing Yu,Yanru Gong,Liu Shi,Qiegen Liu
机构: School of Mathematics and Computer Sciences, Nanchang University, Nanchang, China(南昌大学数学与计算机科学学院); School of Information Engineering, Nanchang University, Nanchang, China(南昌大学信息工程学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Diffusion model shows remarkable potential on sparse-view computed tomography (SVCT) reconstruction. However, when a network is trained on a limited sample space, its generalization capability may be constrained, which degrades performance on unfamiliar data. For image generation tasks, this can lead to issues such as blurry details and inconsistencies between regions. To alleviate this problem, we propose a Sinogram-based Wavelet random decomposition And Random mask diffusion Model (SWARM) for SVCT reconstruction. Specifically, introducing a random mask strategy in the sinogram effectively expands the limited training sample space. This enables the model to learn a broader range of data distributions, enhancing its understanding and generalization of data uncertainty. In addition, applying a random training strategy to the high-frequency components of the sinogram wavelet enhances feature representation and improves the ability to capture details in different frequency bands, thereby improving performance and robustness. Two-stage iterative reconstruction method is adopted to ensure the global consistency of the reconstructed image while refining its details. Experimental results demonstrate that SWARM outperforms competing approaches in both quantitative and qualitative performance across various datasets.
zh

[CV-58] Detection of Vascular Leukoencephalopathy in CT Images

【速读】：该论文旨在解决利用人工智能（AI）技术诊断脑白质病（leukoencephalopathy）的问题，这是一种脑小血管疾病，是血管性痴呆和出血性中风的主要原因。研究的关键解决方案是通过卷积神经网络（CNNs）对脑部轴向CT扫描图像进行二分类疾病诊断。研究团队使用了约1200名患者的CT扫描数据集，并通过数据预处理方法（如统一图像尺寸）来提高模型准确性。研究比较了四种神经网络架构：ResNet50、ResNet50 3D、ConvNext和Densenet，其中ConvNext模型在没有预处理的情况下达到了98.5%的最高准确率，优于使用3D卷积的模型。此外，研究还通过Grad-CAM热图技术分析了模型的决策过程，揭示了模型在扫描图像中的关注区域。结果表明，AI，特别是ConvNext架构，能够显著提高脑白质病的诊断准确性，展示了CNNs在医学影像应用中的有效性。

链接: https://arxiv.org/abs/2501.09863
作者: Z. Cernekova,V. Sisik,F. Jafari
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has seen a significant surge in popularity, particularly in its application to medicine. This study explores AI’s role in diagnosing leukoencephalopathy, a small vessel disease of the brain, and a leading cause of vascular dementia and hemorrhagic strokes. We utilized a dataset of approximately 1200 patients with axial brain CT scans to train convolutional neural networks (CNNs) for binary disease classification. Addressing the challenge of varying scan dimensions due to different patient physiologies, we processed the data to a uniform size and applied three preprocessing methods to improve model accuracy. We compared four neural network architectures: ResNet50, ResNet50 3D, ConvNext, and Densenet. The ConvNext model achieved the highest accuracy of 98.5% without any preprocessing, outperforming models with 3D convolutions. To gain insights into model decision-making, we implemented Grad-CAM heatmaps, which highlighted the focus areas of the models on the scans. Our results demonstrate that AI, particularly the ConvNext architecture, can significantly enhance diagnostic accuracy for leukoencephalopathy. This study underscores AI’s potential in advancing diagnostic methodologies for brain diseases and highlights the effectiveness of CNNs in medical imaging applications.
zh

人工智能

[AI-0] Agent 4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems AAAI2025

链接: https://arxiv.org/abs/2501.10332
作者: Weibo Gao,Qi Liu,Linan Yue,Fangzhou Yao,Rui Lv,Zheng Zhang,Hao Wang,Zhenya Huang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI2025

点击查看摘要

Abstract:Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners’ practice efficiency. However, the discrepancy between offline metrics and online performance significantly impedes their progress. To address this challenge, we introduce Agent4Edu, a novel personalized learning simulator leveraging recent advancements in human intelligence through large language models (LLMs). Agent4Edu features LLM-powered generative agents equipped with learner profile, memory, and action modules tailored to personalized learning algorithms. The learner profiles are initialized using real-world response data, capturing practice styles and cognitive factors. Inspired by human psychology theory, the memory module records practice facts and high-level summaries, integrating reflection mechanisms. The action module supports various behaviors, including exercise understanding, analysis, and response generation. Each agent can interact with personalized learning algorithms, such as computerized adaptive testing, enabling a multifaceted evaluation and enhancement of customized services. Through a comprehensive assessment, we explore the strengths and weaknesses of Agent4Edu, emphasizing the consistency and discrepancies in responses between agents and human learners. The code, data, and appendix are publicly available at this https URL.

[AI-1] An Ontology for Social Determinants of Education (SDoEd) based on Human-AI Collaborative Approach

链接: https://arxiv.org/abs/2501.10300
作者: Navya Martin Kollapally,James Geller,Patricia Morreale,Daehan Kwak
类目: Artificial Intelligence (cs.AI)
*备注: Accepted in CONSORTIUM FOR COMPUTING SCIENCES IN COLLEGES

点击查看摘要

Abstract:The use of computational ontologies is well-established in the field of Medical Informatics. The topic of Social Determinants of Health (SDoH) has also received extensive attention. Work at the intersection of ontologies and SDoH has been published. However, a standardized framework for Social Determinants of Education (SDoEd) is lacking. In this paper, we are closing the gap by introducing an SDoEd ontology for creating a precise conceptualization of the interplay between life circumstances of students and their possible educational achievements. The ontology was developed utilizing suggestions from ChatGPT-3.5-010422 and validated using peer-reviewed research articles. The first version of developed ontology was evaluated by human experts in the field of education and validated using standard ontology evaluation software. This version of the SDoEd ontology contains 231 domain concepts, 10 object properties, and 24 data properties

[AI-2] SEANN: A Domain-Informed Neural Network for Epidemiological Insights

链接: https://arxiv.org/abs/2501.10273
作者: Jean-Baptiste Guimbaud,Marc Plantevit,Léa Maître,Rémy Cazabet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In epidemiology, traditional statistical methods such as logistic regression, linear regression, and other parametric models are commonly employed to investigate associations between predictors and health outcomes. However, non-parametric machine learning techniques, such as deep neural networks (DNNs), coupled with explainable AI (XAI) tools, offer new opportunities for this task. Despite their potential, these methods face challenges due to the limited availability of high-quality, high-quantity data in this field. To address these challenges, we introduce SEANN, a novel approach for informed DNNs that leverages a prevalent form of domain-specific knowledge: Pooled Effect Sizes (PES). PESs are commonly found in published Meta-Analysis studies, in different forms, and represent a quantitative form of a scientific consensus. By direct integration within the learning procedure using a custom loss, we experimentally demonstrate significant improvements in the generalizability of predictive performances and the scientific plausibility of extracted relationships compared to a domain-knowledge agnostic neural network in a scarce and noisy data setting.

[AI-3] Random-Key Algorithms for Optimizing Integrated Operating Room Scheduling

链接: https://arxiv.org/abs/2501.10243
作者: Bruno Salezze Vieira,Eduardo Machado Silva,Antonio Augusto Chaves
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Combinatorics (math.CO)
*备注: 38 pages, Preprint submitted to Applied Soft Computing

点击查看摘要

Abstract:Efficient surgery room scheduling is essential for hospital efficiency, patient satisfaction, and resource utilization. This study addresses this challenge by introducing a novel concept of Random-Key Optimizer (RKO), rigorously tested on literature and new, real-world inspired instances. Our combinatorial optimization problem incorporates multi-room scheduling, equipment scheduling, and complex availability constraints for rooms, patients, and surgeons, facilitating rescheduling and enhancing operational flexibility. The RKO approach represents solutions as points in a continuous space, which are then mapped in the problem solution space via a deterministic function known as a decoder. The core idea is to operate metaheuristics and heuristics in the random-key space, unaware of the original solution space. We design the Biased Random-Key Genetic Algorithm with Q -Learning, Simulated Annealing, and Iterated Local Search for use within an RKO framework, employing a single decoder function. The proposed metaheuristics are complemented by lower-bound formulations, providing optimal gaps for evaluating the effectiveness of the heuristic results. Our results demonstrate significant lower and upper bounds improvements for the literature instances, notably proving one optimal result. Furthermore, the best-proposed metaheuristic efficiently generates schedules for the newly introduced instances, even in highly constrained scenarios. This research offers valuable insights and practical solutions for improving surgery scheduling processes, offering tangible benefits to hospitals by optimising resource allocation, reducing patient wait times, and enhancing overall operational efficiency.

[AI-4] Challenges and recommendations for Electronic Health Records data extraction and preparation for dynamic prediction modelling in hospitalized patients – a practical guide

链接: https://arxiv.org/abs/2501.10240
作者: Elena Albu,Shan Gao,Pieter Stijnen,Frank E. Rademakers,Bas C T van Bussel,Taya Collyer,Tina Hernandez-Boussard,Laure Wynants,Ben Van Calster
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dynamic predictive modeling using electronic health record (EHR) data has gained significant attention in recent years. The reliability and trustworthiness of such models depend heavily on the quality of the underlying data, which is largely determined by the stages preceding the model development: data extraction from EHR systems and data preparation. We list over forty challenges encountered during these stages and provide actionable recommendations for addressing them. These challenges are organized into four categories: cohort definition, outcome definition, feature engineering, and data cleaning. This list is designed to serve as a practical guide for data extraction engineers and researchers, supporting better practices and improving the quality and real-world applicability of dynamic prediction models in clinical settings.

[AI-5] mporal Causal Reasoning with (Non-Recursive) Structural Equation Models

链接: https://arxiv.org/abs/2501.10190
作者: Maksim Gladyshev,Natasha Alechina,Mehdi Dastani,Dragan Doder,Brian Logan
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Structural Equation Models (SEM) are the standard approach to representing causal dependencies between variables in causal models. In this paper we propose a new interpretation of SEMs when reasoning about Actual Causality, in which SEMs are viewed as mechanisms transforming the dynamics of exogenous variables into the dynamics of endogenous variables. This allows us to combine counterfactual causal reasoning with existing temporal logic formalisms, and to introduce a temporal logic, CPLTL, for causal reasoning about such structures. We show that the standard restriction to so-called \textitrecursive models (with no cycles in the dependency graph) is not necessary in our approach, allowing us to reason about mutually dependent processes and feedback loops. Finally, we introduce new notions of model equivalence for temporal causal models, and show that CPLTL has an efficient model-checking procedure.

[AI-6] Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?

链接: https://arxiv.org/abs/2501.10187
作者: Burcu Canakci,Junyi Liu,Xingbo Wu,Nathanaël Cheriere,Paolo Costa,Sergey Legtchenko,Dushyanth Narayanan,Ant Rowstron
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 5+ pages, 4 figures

点击查看摘要

Abstract:To match the blooming demand of generative AI workloads, GPU designers have so far been trying to pack more and more compute and memory into single complex and expensive packages. However, there is growing uncertainty about the scalability of individual GPUs and thus AI clusters, as state-of-the-art GPUs are already displaying packaging, yield, and cooling limitations. We propose to rethink the design and scaling of AI clusters through efficiently-connected large clusters of Lite-GPUs, GPUs with single, small dies and a fraction of the capabilities of larger GPUs. We think recent advances in co-packaged optics can be key in overcoming the communication challenges of distributing AI workloads onto more Lite-GPUs. In this paper, we present the key benefits of Lite-GPUs on manufacturing cost, blast radius, yield, and power efficiency; and discuss systems opportunities and challenges around resource, workload, memory, and network management.

[AI-7] Generative Artificial Intelligence: Implications for Biomedical and Health Professions Education

链接: https://arxiv.org/abs/2501.10186
作者: William Hersh
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI has had a profound impact on biomedicine and health, both in professional work and in education. Based on large language models (LLMs), generative AI has been found to perform as well as humans in simulated situations taking medical board exams, answering clinical questions, solving clinical cases, applying clinical reasoning, and summarizing information. Generative AI is also being used widely in education, performing well in academic courses and their assessments. This review summarizes the successes of LLMs and highlights some of their challenges in the context of education, most notably aspects that may undermines the acquisition of knowledge and skills for professional work. It then provides recommendations for best practices overcoming shortcomings for LLM use in education. Although there are challenges for use of generative AI in education, all students and faculty, in biomedicine and health and beyond, must have understanding and be competent in its use.

[AI-8] CSSDM Ontology to Enable Continuity of Care Data Interoperability

链接: https://arxiv.org/abs/2501.10160
作者: Subhashis Das,Debashis Naskar,Sara Rodriguez Gonzalez,Pamela Hussey
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures, Published in: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

点击查看摘要

Abstract:The rapid advancement of digital technologies and recent global pandemic scenarios have led to a growing focus on how these technologies can enhance healthcare service delivery and workflow to address crises. Action plans that consolidate existing digital transformation programs are being reviewed to establish core infrastructure and foundations for sustainable healthcare solutions. Reforming health and social care to personalize home care, for example, can help avoid treatment in overcrowded acute hospital settings and improve the experiences and outcomes for both healthcare professionals and service users. In this information-intensive domain, addressing the interoperability challenge through standards-based roadmaps is crucial for enabling effective connections between health and social care services. This approach facilitates safe and trustworthy data workflows between different healthcare system providers. In this paper, we present a methodology for extracting, transforming, and loading data through a semi-automated process using a Common Semantic Standardized Data Model (CSSDM) to create personalized healthcare knowledge graph (KG). The CSSDM is grounded in the formal ontology of ISO 13940 ContSys and incorporates FHIR-based specifications to support structural attributes for generating KGs. We propose that the CSSDM facilitates data harmonization and linking, offering an alternative approach to interoperability. This approach promotes a novel form of collaboration between companies developing health information systems and cloud-enabled health services. Consequently, it provides multiple stakeholders with access to high-quality data and information sharing.

[AI-9] Region-wise stacking ensembles for estimating brain-age using MRI

链接: https://arxiv.org/abs/2501.10153
作者: Georgios Antonopoulos,Shammi More,Simon B. Eickhoff,Federico Raimondo,Kaustubh R. Patil
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: version1

点击查看摘要

Abstract:Predictive modeling using structural magnetic resonance imaging (MRI) data is a prominent approach to study brain-aging. Machine learning algorithms and feature extraction methods have been employed to improve predictions and explore healthy and accelerated aging e.g. neurodegenerative and psychiatric disorders. The high-dimensional MRI data pose challenges to building generalizable and interpretable models as well as for data privacy. Common practices are resampling or averaging voxels within predefined parcels, which reduces anatomical specificity and biological interpretability as voxels within a region may differently relate to aging. Effectively, naive fusion by averaging can result in information loss and reduced accuracy. We present a conceptually novel two-level stacking ensemble (SE) approach. The first level comprises regional models for predicting individuals’ age based on voxel-wise information, fused by a second-level model yielding final predictions. Eight data fusion scenarios were explored using as input Gray matter volume (GMV) estimates from four datasets covering the adult lifespan. Performance, measured using mean absolute error (MAE), R2, correlation and prediction bias, showed that SE outperformed the region-wise averages. The best performance was obtained when first-level regional predictions were obtained as out-of-sample predictions on the application site with second-level models trained on independent and site-specific data (MAE=4.75 vs baseline regional mean GMV MAE=5.68). Performance improved as more datasets were used for training. First-level predictions showed improved and more robust aging signal providing new biological insights and enhanced data privacy. Overall, the SE improves accuracy compared to the baseline while preserving or enhancing data privacy.

[AI-10] opology-Driven Attribute Recovery for Attribute Missing Graph Learning in Social Internet of Things

链接: https://arxiv.org/abs/2501.10151
作者: Mengran Li,Junzhou Chen,Chenyun Yu,Guanying Jiang,Ronghui Zhang,Yanming Shen,Houbing Herbert Song
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE Internet of Things Journal

点击查看摘要

Abstract:With the advancement of information technology, the Social Internet of Things (SIoT) has fostered the integration of physical devices and social networks, deepening the study of complex interaction patterns. Text Attribute Graphs (TAGs) capture both topological structures and semantic attributes, enhancing the analysis of complex interactions within the SIoT. However, existing graph learning methods are typically designed for complete attributed graphs, and the common issue of missing attributes in Attribute Missing Graphs (AMGs) increases the difficulty of analysis tasks. To address this, we propose the Topology-Driven Attribute Recovery (TDAR) framework, which leverages topological data for AMG learning. TDAR introduces an improved pre-filling method for initial attribute recovery using native graph topology. Additionally, it dynamically adjusts propagation weights and incorporates homogeneity strategies within the embedding space to suit AMGs’ unique topological structures, effectively reducing noise during information propagation. Extensive experiments on public datasets demonstrate that TDAR significantly outperforms state-of-the-art methods in attribute reconstruction and downstream tasks, offering a robust solution to the challenges posed by AMGs. The code is available at this https URL.

[AI-11] Enhancing UAV Path Planning Efficiency Through Accelerated Learning

链接: https://arxiv.org/abs/2501.10141
作者: Joseanne Viana,Boris Galkin,Lester Ho,Holger Claussen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper was accepted in this https URL conference but it is not available from the conference yet

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly essential in various fields such as surveillance, reconnaissance, and telecommunications. This study aims to develop a learning algorithm for the path planning of UAV wireless communication relays, which can reduce storage requirements and accelerate Deep Reinforcement Learning (DRL) convergence. Assuming the system possesses terrain maps of the area and can estimate user locations using localization algorithms or direct GPS reporting, it can input these parameters into the learning algorithms to achieve optimized path planning performance. However, higher resolution terrain maps are necessary to extract topological information such as terrain height, object distances, and signal blockages. This requirement increases memory and storage demands on UAVs while also lengthening convergence times in DRL algorithms. Similarly, defining the telecommunication coverage map in UAV wireless communication relays using these terrain maps and user position estimations demands higher memory and storage utilization for the learning path planning algorithms. Our approach reduces path planning training time by applying a dimensionality reduction technique based on Principal Component Analysis (PCA), sample combination, Prioritized Experience Replay (PER), and the combination of Mean Squared Error (MSE) and Mean Absolute Error (MAE) loss calculations in the coverage map estimates, thereby enhancing a Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. The proposed solution reduces the convergence episodes needed for basic training by approximately four times compared to the traditional TD3.

[AI-12] Conformal Prediction Sets with Improved Conditional Coverag e using Trust Scores

链接: https://arxiv.org/abs/2501.10139
作者: Jivat Neet Kaur,Michael I. Jordan,Ahmed Alaa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Standard conformal prediction offers a marginal guarantee on coverage, but for prediction sets to be truly useful, they should ideally ensure coverage conditional on each test point. Unfortunately, it is impossible to achieve exact, distribution-free conditional coverage in finite samples. In this work, we propose an alternative conformal prediction algorithm that targets coverage where it matters most–in instances where a classifier is overconfident in its incorrect predictions. We start by dissecting miscoverage events in marginally-valid conformal prediction, and show that miscoverage rates vary based on the classifier’s confidence and its deviation from the Bayes optimal classifier. Motivated by this insight, we develop a variant of conformal prediction that targets coverage conditional on a reduced set of two variables: the classifier’s confidence in a prediction and a nonparametric trust score that measures its deviation from the Bayes classifier. Empirical evaluation on multiple image datasets shows that our method generally improves conditional coverage properties compared to standard conformal prediction, including class-conditional coverage, coverage over arbitrary subgroups, and coverage over demographic groups.

[AI-13] Exploring the Impact of Generative Artificial Intelligence in Education: A Thematic Analysis

链接: https://arxiv.org/abs/2501.10134
作者: Abhishek Kaushik(1),Sargam Yadav(1),Andrew Browne(2),David Lillis(3),David Williams(2),Jack Mc Donnell(1),Peadar Grant(1),Siobhan Connolly Kernan(1),Shubham Sharma(4),Mansi Arora(5) ((1) School of Informatics and Creative Arts, Dundalk Institute of Technology, Dundalk, Co. Louth, Ireland, (2) Dublin Business School, Dublin, Co. Dublin, Ireland (3) University College Dublin, Belfield, Dublin, Co. Dublin, Ireland (4) Technological University Dublin, Dublin, Co. Dublin, Ireland (5) Jagan Institute of Management Studies, Rohini, Delhi, Delhi, India)
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent advancements in Generative Artificial intelligence (GenAI) technology have been transformative for the field of education. Large Language Models (LLMs) such as ChatGPT and Bard can be leveraged to automate boilerplate tasks, create content for personalised teaching, and handle repetitive tasks to allow more time for creative thinking. However, it is important to develop guidelines, policies, and assessment methods in the education sector to ensure the responsible integration of these tools. In this article, thematic analysis has been performed on seven essays obtained from professionals in the education sector to understand the advantages and pitfalls of using GenAI models such as ChatGPT and Bard in education. Exploratory Data Analysis (EDA) has been performed on the essays to extract further insights from the text. The study found several themes which highlight benefits and drawbacks of GenAI tools, as well as suggestions to overcome these limitations and ensure that students are using these tools in a responsible and ethical manner.

[AI-14] Infrastructure for AI Agents

链接: https://arxiv.org/abs/2501.10114
作者: Alan Chan,Kevin Wei,Sihao Huang,Nitarshan Rajkumar,Elija Perrier,Seth Lazar,Gillian K. Hadfield,Markus Anderljung
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Increasingly many AI systems can plan and execute interactions in open-ended environments, such as making phone calls or buying online goods. As developers grow the space of tasks that such AI agents can accomplish, we will need tools both to unlock their benefits and manage their risks. Current tools are largely insufficient because they are not designed to shape how agents interact with existing institutions (e.g., legal and economic systems) or actors (e.g., digital service providers, humans, other AI agents). For example, alignment techniques by nature do not assure counterparties that some human will be held accountable when a user instructs an agent to perform an illegal action. To fill this gap, we propose the concept of agent infrastructure: technical systems and shared protocols external to agents that are designed to mediate and influence their interactions with and impacts on their environments. Agent infrastructure comprises both new tools and reconfigurations or extensions of existing tools. For example, to facilitate accountability, protocols that tie users to agents could build upon existing systems for user authentication, such as OpenID. Just as the Internet relies on infrastructure like HTTPS, we argue that agent infrastructure will be similarly indispensable to ecosystems of agents. We identify three functions for agent infrastructure: 1) attributing actions, properties, and other information to specific agents, their users, or other actors; 2) shaping agents’ interactions; and 3) detecting and remedying harmful actions from agents. We propose infrastructure that could help achieve each function, explaining use cases, adoption, limitations, and open questions. Making progress on agent infrastructure can prepare society for the adoption of more advanced agents.

[AI-15] LLM Reason er and Automated Planner: A new NPC approach

链接: https://arxiv.org/abs/2501.10106
作者: Israel Puerta-Merino,Jordi Sabater-Mir
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 7 figures, extended version of the homonymous paper submitted to the Catalan Conference on Artificial Intelligent (CCIA) 2025

点击查看摘要

Abstract:In domains requiring intelligent agents to emulate plausible human-like behaviour, such as formative simulations, traditional techniques like behaviour trees encounter significant challenges. Large Language Models (LLMs), despite not always yielding optimal solutions, usually offer plausible and human-like responses to a given problem. In this paper, we exploit this capability and propose a novel architecture that integrates an LLM for decision-making with a classical automated planner that can generate sound plans for that decision. The combination aims to equip an agent with the ability to make decisions in various situations, even if they were not anticipated during the design phase.

[AI-16] Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

链接: https://arxiv.org/abs/2501.10100
作者: Chenhao Li,Andreas Krause,Marco Hutter
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. Through extensive experiments, our approach consistently outperforms state-of-the-art methods, demonstrating superior autoregressive prediction accuracy, robustness to noise, and generalization across manipulation and locomotion tasks. Notably, policies trained with our method are successfully deployed on ANYmal D hardware in a zero-shot transfer, achieving robust performance with minimal sim-to-real performance loss. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.

[AI-17] How Do Programming Students Use Generative AI?

链接: https://arxiv.org/abs/2501.10091
作者: Christian Rahe,Walid Maalej
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: preprint; accepted to ACM International Conference on the Foundations of Software Engineering (FSE) 2025

点击查看摘要

Abstract:Programming students have a widespread access to powerful Generative AI tools like ChatGPT. While this can help understand the learning material and assist with exercises, educators are voicing more and more concerns about an over-reliance on generated outputs and lack of critical thinking skills. It is thus important to understand how students actually use generative AI and what impact this could have on their learning behavior. To this end, we conducted a study including an exploratory experiment with 37 programming students, giving them monitored access to ChatGPT while solving a code understanding and improving exercise. While only 23 of the students actually opted to use the chatbot, the majority of those eventually prompted it to simply generate a full solution. We observed two prevalent usage strategies: to seek knowledge about general concepts and to directly generate solutions. Instead of using the bot to comprehend the code and their own mistakes, students often got trapped in a vicious cycle of submitting wrong generated code and then asking the bot for a fix. Those who self-reported using generative AI regularly were more likely to prompt the bot to generate a solution. Our findings indicate that concerns about potential decrease in programmers’ agency and productivity with Generative AI are justified. We discuss how researchers and educators can respond to the potential risk of students uncritically over-relying on generative AI. We also discuss potential modifications to our study design for large-scale replications.

[AI-18] A Survey on LLM Test-Time Compute via Search: Tasks LLM Profiling Search Algorithms and Relevant Frameworks

链接: https://arxiv.org/abs/2501.10069
作者: Xinzhe Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLM test-time compute (or LLM inference) via search has emerged as a promising research area with rapid developments. However, current frameworks often adopt distinct perspectives on three key aspects (task definition, LLM profiling, and search procedures), making direct comparisons challenging. Moreover, the search algorithms employed often diverge from standard implementations, and their specific characteristics are not thoroughly specified. In this survey, we provide a comprehensive technical review that unifies task definitions and provides modular definitions of LLM profiling and search procedures. The definitions enable precise comparisons of various LLM inference frameworks while highlighting their departures from conventional search algorithms. We also discuss the applicability, performance, and efficiency of these methods. For further details and ongoing updates, please refer to our GitHub repository: this https URL

[AI-19] Accelerating Large Language Models through Partially Linear Feed-Forward Network

链接: https://arxiv.org/abs/2501.10054
作者: Gansen Hu,Zhaoguo Wang,Jinglin Wei,Wei Huang,Haibo Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their massive parameter counts. While existing compression techniques like pruning can reduce model size, it leads to significant accuracy degradation under high compression ratios. We present a novel perspective inspired by constant folding in compiler optimization. Our approach enables parameter reduction by treating activation functions in LLMs as linear functions. However, recent LLMs use complex non-linear activations like GELU that prevent direct application of this technique. We propose TARDIS, which enables optimization of LLMs with non-linear activations by partially approximating them with linear functions in frequently occurring input ranges. For outlier inputs, TARDIS employs an online predictor to dynamically fall back to original computations. Our experiments demonstrate that TARDIS achieves 80% parameter reduction in feed-forward networks, while significantly outperforming state-of-the-art pruning methods Wanda and RIA with up to 65% higher accuracy. In practical deployments for a 7B model, TARDIS achieves 1.6x end-to-end inference speedup when integrated with the vLLM serving system, and 1.4x speedup with the widely adopted HuggingFace implementation, while incurring only a 10.9% accuracy trade-off. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) ACMclasses: D.4; I.2; D.3.4 Cite as: arXiv:2501.10054 [cs.LG] (or arXiv:2501.10054v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.10054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] AirRAG : Activating Intrinsic Reasoning for Retrieval Augmented Generation via Tree-based Search

链接: https://arxiv.org/abs/2501.10053
作者: Wenfeng Feng,Chuzhan Hao,Yuewei Zhang,Jingyi Song,Hao Wang
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages, 14 figures

点击查看摘要

Abstract:Leveraging the autonomous decision-making capabilities of large language models (LLMs) demonstrates superior performance in reasoning tasks. Despite the successes of iterative or recursive retrieval-augmented generation (RAG), they often are trapped in a single solution space when confronted with complex tasks. In this paper, we propose a novel thinking pattern in RAG which integrates system analysis with efficient reasoning actions, significantly activating intrinsic reasoning capabilities and expanding the solution space of specific tasks via Monte Carlo Tree Search (MCTS), dubbed AirRAG. Specifically, our approach designs five fundamental reasoning actions that are expanded to a wide tree-based reasoning spaces using MCTS. The extension also uses self-consistency verification to explore potential reasoning paths and implement inference scaling. In addition, computationally optimal strategies are used to apply more inference computation to key actions to achieve further performance improvements. Experimental results demonstrate the effectiveness of AirRAG through considerable performance gains over complex QA datasets. Furthermore, AirRAG is flexible and lightweight, making it easy to integrate with other advanced technologies.

[AI-21] Virtual Nodes Improve Long-term Traffic Prediction

链接: https://arxiv.org/abs/2501.10048
作者: Xiaoyang Cao,Dingyi Zhuang,Jinhua Zhao,Shenhao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Effective traffic prediction is a cornerstone of intelligent transportation systems, enabling precise forecasts of traffic flow, speed, and congestion. While traditional spatio-temporal graph neural networks (ST-GNNs) have achieved notable success in short-term traffic forecasting, their performance in long-term predictions remains limited. This challenge arises from over-squashing problem, where bottlenecks and limited receptive fields restrict information flow and hinder the modeling of global dependencies. To address these challenges, this study introduces a novel framework that incorporates virtual nodes, which are additional nodes added to the graph and connected to existing nodes, in order to aggregate information across the entire graph within a single GNN layer. Our proposed model incorporates virtual nodes by constructing a semi-adaptive adjacency matrix. This matrix integrates distance-based and adaptive adjacency matrices, allowing the model to leverage geographical information while also learning task-specific features from data. Experimental results demonstrate that the inclusion of virtual nodes significantly enhances long-term prediction accuracy while also improving layer-wise sensitivity to mitigate the over-squashing problem. Virtual nodes also offer enhanced explainability by focusing on key intersections and high-traffic areas, as shown by the visualization of their adjacency matrix weights on road network heat maps. Our advanced approach enhances the understanding and management of urban traffic systems, making it particularly well-suited for real-world applications.

[AI-22] Spatiotemporal Prediction of Secondary Crashes by Rebalancing Dynamic and Static Data with Generative Adversarial Networks

链接: https://arxiv.org/abs/2501.10041
作者: Junlan Chen,Yiqun Li,Chenyu Ling,Ziyuan Pu,Xiucheng Guo
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data imbalance is a common issue in analyzing and predicting sudden traffic events. Secondary crashes constitute only a small proportion of all crashes. These secondary crashes, triggered by primary crashes, significantly exacerbate traffic congestion and increase the severity of incidents. However, the severe imbalance of secondary crash data poses significant challenges for prediction models, affecting their generalization ability and prediction accuracy. Existing methods fail to fully address the complexity of traffic crash data, particularly the coexistence of dynamic and static features, and often struggle to effectively handle data samples of varying lengths. Furthermore, most current studies predict the occurrence probability and spatiotemporal distribution of secondary crashes separately, lacking an integrated solution. To address these challenges, this study proposes a hybrid model named VarFusiGAN-Transformer, aimed at improving the fidelity of secondary crash data generation and jointly predicting the occurrence and spatiotemporal distribution of secondary crashes. The VarFusiGAN-Transformer model employs Long Short-Term Memory (LSTM) networks to enhance the generation of multivariate long-time series data, incorporating a static data generator and an auxiliary discriminator to model the joint distribution of dynamic and static features. In addition, the model’s prediction module achieves simultaneous prediction of both the occurrence and spatiotemporal distribution of secondary crashes. Compared to existing methods, the proposed model demonstrates superior performance in generating high-fidelity data and improving prediction accuracy.

[AI-23] Enhancing Crash Frequency Modeling Based on Augmented Multi-Type Data by Hybrid VAE-Diffusion-Based Generative Neural Networks

链接: https://arxiv.org/abs/2501.10017
作者: Junlan Chen,Qijie He,Pei Liu,Wei Ma,Ziyuan Pu
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Crash frequency modelling analyzes the impact of factors like traffic volume, road geometry, and environmental conditions on crash occurrences. Inaccurate predictions can distort our understanding of these factors, leading to misguided policies and wasted resources, which jeopardize traffic safety. A key challenge in crash frequency modelling is the prevalence of excessive zero observations, caused by underreporting, the low probability of crashes, and high data collection costs. These zero observations often reduce model accuracy and introduce bias, complicating safety decision making. While existing approaches, such as statistical methods, data aggregation, and resampling, attempt to address this issue, they either rely on restrictive assumptions or result in significant information loss, distorting crash data. To overcome these limitations, we propose a hybrid VAE-Diffusion neural network, designed to reduce zero observations and handle the complexities of multi-type tabular crash data (count, ordinal, nominal, and real-valued variables). We assess the synthetic data quality generated by this model through metrics like similarity, accuracy, diversity, and structural consistency, and compare its predictive performance against traditional statistical models. Our findings demonstrate that the hybrid VAE-Diffusion model outperforms baseline models across all metrics, offering a more effective approach to augmenting crash data and improving the accuracy of crash frequency predictions. This study highlights the potential of synthetic data to enhance traffic safety by improving crash frequency modelling and informing better policy decisions.

[AI-24] Adaptive Spatiotemporal Augmentation for Improving Dynamic Graph Learning ICASSP2025

链接: https://arxiv.org/abs/2501.10010
作者: Xu Chu,Hanlin Xue,Bingce Wang,Xiaoyang Liu,Weiping Li,Tong Mo,Tuoyu Feng,Zhijie Tan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Dynamic graph augmentation is used to improve the performance of dynamic GNNs. Most methods assume temporal locality, meaning that recent edges are more influential than earlier edges. However, for temporal changes in edges caused by random noise, overemphasizing recent edges while neglecting earlier ones may lead to the model capturing noise. To address this issue, we propose STAA (SpatioTemporal Activity-Aware Random Walk Diffusion). STAA identifies nodes likely to have noisy edges in spatiotemporal dimensions. Spatially, it analyzes critical topological positions through graph wavelet coefficients. Temporally, it analyzes edge evolution through graph wavelet coefficient change rates. Then, random walks are used to reduce the weights of noisy edges, deriving a diffusion matrix containing spatiotemporal information as an augmented adjacency matrix for dynamic GNN learning. Experiments on multiple datasets show that STAA outperforms other dynamic graph augmentation methods in node classification and link prediction tasks.

[AI-25] Fast energy-aware OLSR routing in VANETs by means of a parallel evolutionary algorithm

链接: https://arxiv.org/abs/2501.09996
作者: Jamal Toutouh,Sergio Nesmachnow,Enrique Alba
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:This work tackles the problem of reducing the power consumption of the OLSR routing protocol in vehicular networks. Nowadays, energy-aware and green communication protocols are important research topics, specially when deploying wireless mobile networks. This article introduces a fast automatic methodology to search for energy-efficient OLSR configurations by using a parallel evolutionary algorithm. The experimental analysis demonstrates that significant improvements over the standard configuration can be attained in terms of power consumption, with no noteworthy loss in the QoS.

[AI-26] GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions AAAI AAAI-25

链接: https://arxiv.org/abs/2501.09972
作者: Heda Zuo,Weitao You,Junxian Wu,Shihong Ren,Pei Chen,Mingxu Zhou,Yujia Lu,Lingyun Sun
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted by the 39th AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.

[AI-27] AIRCHITECT v2: Learning the Hardware Accelerator Design Space through Unified Representations DATE2025

链接: https://arxiv.org/abs/2501.09954
作者: Jamin Seo,Akshat Ramachandran,Yu-Chuan Chuang,Anirudh Itagi,Tushar Krishna
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Accepted to DATE 2025

点击查看摘要

Abstract:Design space exploration (DSE) plays a crucial role in enabling custom hardware architectures, particularly for emerging applications like AI, where optimized and specialized designs are essential. With the growing complexity of deep neural networks (DNNs) and the introduction of advanced foundational models (FMs), the design space for DNN accelerators is expanding at an exponential rate. Additionally, this space is highly non-uniform and non-convex, making it increasingly difficult to navigate and optimize. Traditional DSE techniques rely on search-based methods, which involve iterative sampling of the design space to find the optimal solution. However, this process is both time-consuming and often fails to converge to the global optima for such design spaces. Recently, AIrchitect v1, the first attempt to address the limitations of search-based techniques, transformed DSE into a constant-time classification problem using recommendation networks. In this work, we propose AIrchitect v2, a more accurate and generalizable learning-based DSE technique applicable to large-scale design spaces that overcomes the shortcomings of earlier approaches. Specifically, we devise an encoder-decoder transformer model that (a) encodes the complex design space into a uniform intermediate representation using contrastive learning and (b) leverages a novel unified representation blending the advantages of classification and regression to effectively explore the large DSE space without sacrificing accuracy. Experimental results evaluated on 10^5 real DNN workloads demonstrate that, on average, AIrchitect v2 outperforms existing techniques by 15% in identifying optimal design points. Furthermore, to demonstrate the generalizability of our method, we evaluate performance on unseen model workloads (LLMs) and attain a 1.7x improvement in inference latency on the identified hardware architecture.

[AI-28] MultiPruner: Balanced Structure Removal in Foundation Models

链接: https://arxiv.org/abs/2501.09949
作者: J. Pablo Muñoz,Jinjie Yuan,Nilesh Jain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at this https URL.

[AI-29] AI Explainability for Power Electronics: From a Lipschitz Continuity Perspective

链接: https://arxiv.org/abs/2501.09948
作者: Xinze Li,Fanfan Lin,Homer Alan Mantooth,Juan José Rodríguez-Andina
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lifecycle management of power converters continues to thrive with emerging artificial intelligence (AI) solutions, yet AI mathematical explainability remains unexplored in power electronics (PE) community. The lack of theoretical rigor challenges adoption in mission-critical applications. Therefore, this letter proposes a generic framework to evaluate mathematical explainability, highlighting inference stability and training convergence from a Lipschitz continuity perspective. Inference stability governs consistent outputs under input perturbations, essential for robust real-time control and fault diagnosis. Training convergence guarantees stable learning dynamics, facilitating accurate modeling in PE contexts. Additionally, a Lipschitz-aware learning rate selection strategy is introduced to accelerate convergence while mitigating overshoots and oscillations. The feasibility of the proposed Lipschitz-oriented framework is demonstrated by validating the mathematical explainability of a state-of-the-art physics-in-architecture neural network, and substantiated through empirical case studies on dual-active-bridge converters. This letter serves as a clarion call for the PE community to embrace mathematical explainability, heralding a transformative era of trustworthy and explainable AI solutions that potentially redefine the future of power electronics.

[AI-30] Client-Centric Federated Adaptive Optimization

链接: https://arxiv.org/abs/2501.09946
作者: Jianhui Sun,Xidong Wu,Heng Huang,Aidong Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed learning paradigm where clients collaboratively train a model while keeping their own data private. With an increasing scale of clients and models, FL encounters two key challenges, client drift due to a high degree of statistical/system heterogeneity, and lack of adaptivity. However, most existing FL research is based on unrealistic assumptions that virtually ignore system heterogeneity. In this paper, we propose Client-Centric Federated Adaptive Optimization, which is a class of novel federated adaptive optimization approaches. We enable several features in this framework such as arbitrary client participation, asynchronous server aggregation, and heterogeneous local computing, which are ubiquitous in real-world FL systems but are missed in most existing works. We provide a rigorous convergence analysis of our proposed framework for general nonconvex objectives, which is shown to converge with the best-known rate. Extensive experiments show that our approaches consistently outperform the baseline by a large margin across benchmarks.

[AI-31] HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning

链接: https://arxiv.org/abs/2501.09934
作者: Xiaohong Yang,Minghui Liwang,Xianbin Wang,Zhipeng Cheng,Seyyedali Hosseinalipour,Huaiyu Dai,Zhenzhen Jiao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 6 figures,

点击查看摘要

Abstract:The rapid growth of AI-enabled Internet of Vehicles (IoV) calls for efficient machine learning (ML) solutions that can handle high vehicular mobility and decentralized data. This has motivated the emergence of Hierarchical Federated Learning over vehicle-edge-cloud architectures (VEC-HFL). Nevertheless, one aspect which is underexplored in the literature on VEC-HFL is that vehicles often need to execute multiple ML tasks simultaneously, where this multi-model training environment introduces crucial challenges. First, improper aggregation rules can lead to model obsolescence and prolonged training times. Second, vehicular mobility may result in inefficient data utilization by preventing the vehicles from returning their models to the network edge. Third, achieving a balanced resource allocation across diverse tasks becomes of paramount importance as it majorly affects the effectiveness of collaborative training. We take one of the first steps towards addressing these challenges via proposing a framework for multi-model training in dynamic VEC-HFL with the goal of minimizing global training latency while ensuring balanced training across various tasks-a problem that turns out to be NP-hard. To facilitate timely model training, we introduce a hybrid synchronous-asynchronous aggregation rule. Building on this, we present a novel method called Hybrid Evolutionary And gReedy allocaTion (HEART). The framework operates in two stages: first, it achieves balanced task scheduling through a hybrid heuristic approach that combines improved Particle Swarm Optimization (PSO) and Genetic Algorithms (GA); second, it employs a low-complexity greedy algorithm to determine the training priority of assigned tasks on vehicles. Experiments on real-world datasets demonstrate the superiority of HEART over existing methods.

[AI-32] Study on a Fast Solver for Combined Field Integral Equations of 3D Conducting Bodies Based on Graph Neural Networks

链接: https://arxiv.org/abs/2501.09923
作者: Tao Shan,Xin Zhang,Di Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注: 10 pages,11 figures

点击查看摘要

Abstract:In this paper, we present a graph neural networks (GNNs)-based fast solver (GraphSolver) for solving combined field integral equations (CFIEs) of 3D conducting bodies. Rao-Wilton-Glisson (RWG) basis functions are employed to discretely and accurately represent the geometry of 3D conducting bodies. A concise and informative graph representation is then constructed by treating each RWG function as a node in the graph, enabling the flow of current between nodes. With the transformed graphs, GraphSolver is developed to directly predict real and imaginary parts of the x, y and z components of the surface current densities at each node (RWG function). Numerical results demonstrate the efficacy of GraphSolver in solving CFIEs for 3D conducting bodies with varying levels of geometric complexity, including basic 3D targets, missile-shaped targets, and airplane-shaped targets.

[AI-33] GenSC-6G: A Prototype Testbed for Integrated Generative AI Quantum and Semantic Communication

链接: https://arxiv.org/abs/2501.09918
作者: Brian E. Arfeto,Shehbaz Tariq,Uman Khalid,Trung Q. Duong,Hyundong Shin
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Quantum Physics (quant-ph)
*备注: SUBMITTED FOR PUBLICATION IN IEEE COMMUNICATIONS MAGAZINE

点击查看摘要

Abstract:We introduce a prototyping testbed, GenSC-6G, developed to generate a comprehensive dataset that supports the integration of generative artificial intelligence (AI), quantum computing, and semantic communication for emerging sixth-generation (6G) applications. The GenSC-6G dataset is designed with noise-augmented synthetic data optimized for semantic decoding, classification, and localization tasks, significantly enhancing flexibility for diverse AI-driven communication applications. This adaptable prototype supports seamless modifications across baseline models, communication modules, and goal-oriented decoders. Case studies demonstrate its application in lightweight classification, semantic upsampling, and edge-based language inference under noise conditions. The GenSC-6G dataset serves as a scalable and robust resource for developing goal-oriented communication systems tailored to the growing demands of 6G networks.

[AI-34] owards A Litmus Test for Common Sense

链接: https://arxiv.org/abs/2501.09913
作者: Hugo Latapie
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper is the second in a planned series aimed at envisioning a path to safe and beneficial artificial intelligence. Building on the conceptual insights of “Common Sense Is All You Need,” we propose a more formal litmus test for common sense, adopting an axiomatic approach that combines minimal prior knowledge (MPK) constraints with diagonal or Godel-style arguments to create tasks beyond the agent’s known concept set. We discuss how this approach applies to the Abstraction and Reasoning Corpus (ARC), acknowledging training/test data constraints, physical or virtual embodiment, and large language models (LLMs). We also integrate observations regarding emergent deceptive hallucinations, in which more capable AI systems may intentionally fabricate plausible yet misleading outputs to disguise knowledge gaps. The overarching theme is that scaling AI without ensuring common sense risks intensifying such deceptive tendencies, thereby undermining safety and trust. Aligning with the broader goal of developing beneficial AI without causing harm, our axiomatic litmus test not only diagnoses whether an AI can handle truly novel concepts but also provides a stepping stone toward an ethical, reliable foundation for future safe, beneficial, and aligned artificial intelligence.

[AI-35] Evolving Deeper LLM Thinking

链接: https://arxiv.org/abs/2501.09891
作者: Kuang-Huei Lee,Ian Fischer,Yueh-Hua Wu,Dave Marwood,Shumeet Baluja,Dale Schuurmans,Xinyun Chen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.

[AI-36] Exploring the Implementation of AI in Early Onset Interviews to Help Mitigate Bias

链接: https://arxiv.org/abs/2501.09890
作者: Nishka Lal,Omar Benkraouda
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the application of artificial intelligence (AI) in early-stage recruitment interviews in order to reduce inherent bias, specifically sentiment bias. Traditional interviewers are often subject to several biases, including interviewer bias, social desirability effects, and even confirmation bias. In turn, this leads to non-inclusive hiring practices, and a less diverse workforce. This study further analyzes various AI interventions that are present in the marketplace today such as multimodal platforms and interactive candidate assessment tools in order to gauge the current market usage of AI in early-stage recruitment. However, this paper aims to use a unique AI system that was developed to transcribe and analyze interview dynamics, which emphasize skill and knowledge over emotional sentiments. Results indicate that AI effectively minimizes sentiment-driven biases by 41.2%, suggesting its revolutionizing power in companies’ recruitment processes for improved equity and efficiency.

[AI-37] From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation AAAI AAAI-25

链接: https://arxiv.org/abs/2501.09858
作者: Peilang Li,Umer Siddique,Yongcan Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted to Deployable AI (DAI) Workshop at the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Deep reinforcement learning (RL) has shown remarkable success in complex domains, however, the inherent black box nature of deep neural network policies raises significant challenges in understanding and trusting the decision-making processes. While existing explainable RL methods provide local insights, they fail to deliver a global understanding of the model, particularly in high-stakes applications. To overcome this limitation, we propose a novel model-agnostic approach that bridges the gap between explainability and interpretability by leveraging Shapley values to transform complex deep RL policies into transparent representations. The proposed approach offers two key contributions: a novel approach employing Shapley values to policy interpretation beyond local explanations and a general framework applicable to off-policy and on-policy algorithms. We evaluate our approach with three existing deep RL algorithms and validate its performance in two classic control environments. The results demonstrate that our approach not only preserves the original models’ performance but also generates more stable interpretable policies.

[AI-38] EVAL: EigenVector-based Averag e-reward Learning AAAI-25

链接: https://arxiv.org/abs/2501.09770
作者: Jacob Adamczyk,Volodymyr Makarenko,Stas Tiomkin,Rahul V. Kulkarni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the AAAI-25 8th Workshop on Generalization in Planning. arXiv admin note: text overlap with arXiv:2501.09080

点击查看摘要

Abstract:In reinforcement learning, two objective functions have been developed extensively in the literature: discounted and averaged rewards. The generalization to an entropy-regularized setting has led to improved robustness and exploration for both of these objectives. Recently, the entropy-regularized average-reward problem was addressed using tools from large deviation theory in the tabular setting. This method has the advantage of linearity, providing access to both the optimal policy and average reward-rate through properties of a single matrix. In this paper, we extend that framework to more general settings by developing approaches based on function approximation by neural networks. This formulation reveals new theoretical insights into the relationship between different objectives used in RL. Additionally, we combine our algorithm with a posterior policy iteration scheme, showing how our approach can also solve the average-reward RL problem without entropy-regularization. Using classic control benchmarks, we experimentally find that our method compares favorably with other algorithms in terms of stability and rate of convergence.

[AI-39] Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR ICASSP2025

链接: https://arxiv.org/abs/2501.10256
作者: Karl El Hajal,Enno Hermann,Ajinkya Kulkarni,Mathew Magimai.-Doss
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at ICASSP 2025 Satellite Workshop: Workshop on Speech Pathology Analysis and DEtection (SPADE)

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at this https URL .

[AI-40] VERITAS: Verifying the Performance of AI-native Transceiver Actions in Base-Stations

链接: https://arxiv.org/abs/2501.09761
作者: Nasim Soltani,Michael Loehning,Kaushik Chowdhury
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Artificial Intelligence (AI)-native receivers prove significant performance improvement in high noise regimes and can potentially reduce communication overhead compared to the traditional receiver. However, their performance highly depends on the representativeness of the training dataset. A major issue is the uncertainty of whether the training dataset covers all test environments and waveform configurations, and thus, whether the trained model is robust in practical deployment conditions. To this end, we propose a joint measurement-recovery framework for AI-native transceivers post deployment, called VERITAS, that continuously looks for distribution shifts in the received signals and triggers finite re-training spurts. VERITAS monitors the wireless channel using 5G pilots fed to an auxiliary neural network that detects out-of-distribution channel profile, transmitter speed, and delay spread. As soon as such a change is detected, a traditional (reference) receiver is activated, which runs for a period of time in parallel to the AI-native receiver. Finally, VERTIAS compares the bit probabilities of the AI-native and the reference receivers for the same received data inputs, and decides whether or not a retraining process needs to be initiated. Our evaluations reveal that VERITAS can detect changes in the channel profile, transmitter speed, and delay spread with 99%, 97%, and 69% accuracies, respectively, followed by timely initiation of retraining for 86%, 93.3%, and 94.8% of inputs in channel profile, transmitter speed, and delay spread test sets, respectively.

机器学习

[LG-0] Credit Risk Identification in Supply Chains Using Generative Adversarial Networks

链接: https://arxiv.org/abs/2501.10348
作者: Zizhou Zhang,Xinshi Li,Yu Cheng,Zhenrui Chen,Qianying Liu
类目: Machine Learning (cs.LG)
*备注: The paper will be published and indexed by IEEE at 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE 2025)

点击查看摘要

Abstract:Credit risk management within supply chains has emerged as a critical research area due to its significant implications for operational stability and financial sustainability. The intricate interdependencies among supply chain participants mean that credit risks can propagate across networks, with impacts varying by industry. This study explores the application of Generative Adversarial Networks (GANs) to enhance credit risk identification in supply chains. GANs enable the generation of synthetic credit risk scenarios, addressing challenges related to data scarcity and imbalanced datasets. By leveraging GAN-generated data, the model improves predictive accuracy while effectively capturing dynamic and temporal dependencies in supply chain data. The research focuses on three representative industries-manufacturing (steel), distribution (pharmaceuticals), and services (e-commerce) to assess industry-specific credit risk contagion. Experimental results demonstrate that the GAN-based model outperforms traditional methods, including logistic regression, decision trees, and neural networks, achieving superior accuracy, recall, and F1 scores. The findings underscore the potential of GANs in proactive risk management, offering robust tools for mitigating financial disruptions in supply chains. Future research could expand the model by incorporating external market factors and supplier relationships to further enhance predictive capabilities. Keywords- Generative Adversarial Networks (GANs); Supply Chain Risk; Credit Risk Identification; Machine Learning; Data Augmentation

[LG-1] ColNet: Collaborative Optimization in Decentralized Federated Multi-task Learning Systems

链接: https://arxiv.org/abs/2501.10347
作者: Chao Feng,Nicolas Fazli Kohler,Alberto Huertas Celdran,Gerome Bovet,Burkhard Stiller
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Federated Learning (FL) and Multi-Task Learning (MTL) has been explored to address client heterogeneity, with Federated Multi-Task Learning (FMTL) treating each client as a distinct task. However, most existing research focuses on data heterogeneity (e.g., addressing non-IID data) rather than task heterogeneity, where clients solve fundamentally different tasks. Additionally, much of the work relies on centralized settings with a server managing the federation, leaving the more challenging domain of decentralized FMTL largely unexplored. Thus, this work bridges this gap by proposing ColNet, a framework designed for heterogeneous tasks in decentralized federated environments. ColNet divides models into the backbone and task-specific layers, forming groups of similar clients, with group leaders performing conflict-averse cross-group aggregation. A pool of experiments with different federations demonstrated ColNet outperforms the compared aggregation schemes in decentralized settings with label and task heterogeneity scenarios.

[LG-2] Hybrid Deep Learning Model for epileptic seizure classification by using 1D-CNN with multi-head attention mechanism

链接: https://arxiv.org/abs/2501.10342
作者: Mohammed Guhdar,Ramadhan J. Mstafa,Abdulhakeem O. Mohammed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epilepsy is a prevalent neurological disorder globally, impacting around 50 million people \citeWHO_epilepsy_50million. Epileptic seizures result from sudden abnormal electrical activity in the brain, which can be read as sudden and significant changes in the EEG signal of the brain. The signal can vary in severity and frequency, which results in loss of consciousness and muscle contractions for a short period of time \citeepilepsyfoundation_myoclonic. Individuals with epilepsy often face significant employment challenges due to safety concerns in certain work environments. Many jobs that involve working at heights, operating heavy machinery, or in other potentially hazardous settings may be restricted for people with seizure disorders. This certainly limits job options and economic opportunities for those living with epilepsy.

[LG-3] owards Human-Guided Data-Centric LLM Co-Pilots

链接: https://arxiv.org/abs/2501.10321
作者: Evgeny Saveliev,Jiashuo Liu,Nabeel Seedat,Anders Boyd,Mihaela van der Schaar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Saveliev, Liu Seedat contributed equally

点击查看摘要

Abstract:Machine learning (ML) has the potential to revolutionize healthcare, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC’s ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains – healthcare, finance, social sciences and more – to actively participate in driving real-world impact using ML.

[LG-4] Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy

链接: https://arxiv.org/abs/2501.10290
作者: Ishank Juneja,Carlee Joe-Wong,Osman Yağan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-armed bandits (MAB) are commonly used in sequential online decision-making when the reward of each decision is an unknown random variable. In practice however, the typical goal of maximizing total reward may be less important than minimizing the total cost of the decisions taken, subject to a reward constraint. For example, we may seek to make decisions that have at least the reward of a reference ``default’’ decision, with as low a cost as possible. This problem was recently introduced in the Multi-Armed Bandits with Cost Subsidy (MAB-CS) framework. MAB-CS is broadly applicable to problem domains where a primary metric (cost) is constrained by a secondary metric (reward), and the rewards are unknown. In our work, we address variants of MAB-CS including ones with reward constrained by the reward of a known reference arm or by the subsidized best reward. We introduce the Pairwise-Elimination (PE) algorithm for the known reference arm variant and generalize PE to PE-CS for the subsidized best reward variant. Our instance-dependent analysis of PE and PE-CS reveals that both algorithms have an order-wise logarithmic upper bound on Cost and Quality Regret, making our policies the first with such a guarantee. Moreover, by comparing our upper and lower bound results we establish that PE is order-optimal for all known reference arm problem instances. Finally, experiments are conducted using the MovieLens 25M and Goodreads datasets for both PE and PE-CS revealing the effectiveness of PE and the superior balance between performance and reliability offered by PE-CS compared to baselines from the literature.

[LG-5] Logarithmic Regret for Nonlinear Control

链接: https://arxiv.org/abs/2501.10261
作者: James Wang,Bruce D. Lee,Ingvar Ziemann,Nikolai Matni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of learning to control an unknown nonlinear dynamical system through sequential interactions. Motivated by high-stakes applications in which mistakes can be catastrophic, such as robotics and healthcare, we study situations where it is possible for fast sequential learning to occur. Fast sequential learning is characterized by the ability of the learning agent to incur logarithmic regret relative to a fully-informed baseline. We demonstrate that fast sequential learning is achievable in a diverse class of continuous control problems where the system dynamics depend smoothly on unknown parameters, provided the optimal control policy is persistently exciting. Additionally, we derive a regret bound which grows with the square root of the number of interactions for cases where the optimal policy is not persistently exciting. Our results provide the first regret bounds for controlling nonlinear dynamical systems depending nonlinearly on unknown parameters. We validate the trends our theory predicts in simulation on a simple dynamical system.

[LG-6] Over-the-Air Multi-Sensor Inference with Neural Networks Using Memristor-Based Analog Computing

链接: https://arxiv.org/abs/2501.10245
作者: Busra Tegin,Muhammad Atif Ali,Tolga M Duman
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注: 34 pages

点击查看摘要

Abstract:Deep neural networks provide reliable solutions for many classification and regression tasks; however, their application in real-time wireless systems with simple sensor networks is limited due to high energy consumption and significant bandwidth needs. This study proposes a multi-sensor wireless inference system with memristor-based analog computing. Given the sensors’ limited computational capabilities, the features from the network’s front end are transmitted to a central device where an L_p -norm inspired approximation of the maximum operation is employed to achieve transformation-invariant features, enabling efficient over-the-air transmission. We also introduce a trainable over-the-air sensor fusion method based on L_p -norm inspired combining function that customizes sensor fusion to match the network and sensor distribution characteristics, enhancing adaptability. To address the energy constraints of sensors, we utilize memristors, known for their energy-efficient in-memory computing, enabling analog-domain computations that reduce energy use and computational overhead in edge computing. This dual approach of memristors and L_p -norm inspired sensor fusion fosters energy-efficient computational and transmission paradigms and serves as a practical energy-efficient solution with minimal performance loss.

[LG-7] SpaceTime: Causal Discovery from Non-Stationary Time Series

链接: https://arxiv.org/abs/2501.10235
作者: Sarah Mameche,Lénaïg Cornanguer,Urmi Ninad,Jilles Vreeken
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding causality is challenging and often complicated by changing causal relationships over time and across environments. Climate patterns, for example, shift over time with recurring seasonal trends, while also depending on geographical characteristics such as ecosystem variability. Existing methods for discovering causal graphs from time series either assume stationarity, do not permit both temporal and spatial distribution changes, or are unaware of locations with the same causal relationships. In this work, we therefore unify the three tasks of causal graph discovery in the non-stationary multi-context setting, of reconstructing temporal regimes, and of partitioning datasets and time intervals into those where invariant causal relationships hold. To construct a consistent score that forms the basis of our method, we employ the Minimum Description Length principle. Our resulting algorithm SPACETIME simultaneously accounts for heterogeneity across space and non-stationarity over time. Given multiple time series, it discovers regime changepoints and a temporal causal graph using non-parametric functional modeling and kernelized discrepancy testing. We also show that our method provides insights into real-world phenomena such as river-runoff measured at different catchments and biosphere-atmosphere interactions across ecosystems.

[LG-8] Counterfactual Explanations for k-means and Gaussian Clustering

链接: https://arxiv.org/abs/2501.10234
作者: Georgios Vardakas,Antonia Karra,Evaggelia Pitoura,Aristidis Likas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactuals have been recognized as an effective approach to explain classifier decisions. Nevertheless, they have not yet been considered in the context of clustering. In this work, we propose the use of counterfactuals to explain clustering solutions. First, we present a general definition for counterfactuals for model-based clustering that includes plausibility and feasibility constraints. Then we consider the counterfactual generation problem for k-means and Gaussian clustering assuming Euclidean distance. Our approach takes as input the factual, the target cluster, a binary mask indicating actionable or immutable features and a plausibility factor specifying how far from the cluster boundary the counterfactual should be placed. In the k-means clustering case, analytical mathematical formulas are presented for computing the optimal solution, while in the Gaussian clustering case (assuming full, diagonal, or spherical covariances) our method requires the numerical solution of a nonlinear equation with a single parameter only. We demonstrate the advantages of our approach through illustrative examples and quantitative experimental comparisons.

[LG-9] Modelling Activity Scheduling Behaviour with Deep Generative Machine Learning

链接: https://arxiv.org/abs/2501.10221
作者: Fred Shone,Tim Hillel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We model human activity scheduling behaviour using a deep generative machine learning approach. Activity schedules, which represent the activities and associated travel behaviours of individuals, are a core component of many applied models in the transport, energy and epidemiology domains. Our data driven approach learns human preferences and scheduling logic without the need for complex interacting combinations of sub-models and custom-rules, this makes our approach significantly faster and simpler to operate that existing approaches. We find activity schedule data combines aspects of both continuous image data and also discrete text data, requiring novel approaches. We additionally contribute a novel schedule representation and comprehensive evaluation framework for generated schedules. Evaluation shows our approach is able to rapidly generate large, diverse and realistic synthetic samples of activity schedules.

[LG-10] he Relevance of AWS Chronos: An Evaluation of Standard Methods for Time Series Forecasting with Limited Tuning

链接: https://arxiv.org/abs/2501.10216
作者: Matthew Baron,Alex Karpinski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A systematic comparison of Chronos, a transformer-based time series forecasting framework, against traditional approaches including ARIMA and Prophet. We evaluate these models across multiple time horizons and user categories, with a focus on the impact of historical context length. Our analysis reveals that while Chronos demonstrates superior performance for longer-term predictions and maintains accuracy with increased context, traditional models show significant degradation as context length increases. We find that prediction quality varies systematically between user classes, suggesting that underlying behavior patterns always influence model performance. This study provides a case for deploying Chronos in real-world applications where limited model tuning is feasible, especially in scenarios requiring longer prediction.

[LG-11] mporal Graph MLP Mixer for Spatio-Temporal Forecasting

链接: https://arxiv.org/abs/2501.10214
作者: Muhammad Bilal,Luis Carretero Lopez
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatiotemporal forecasting is critical in applications such as traffic prediction, climate modeling, and environmental monitoring. However, the prevalence of missing data in real-world sensor networks significantly complicates this task. In this paper, we introduce the Temporal Graph MLP-Mixer (T-GMM), a novel architecture designed to address these challenges. The model combines node-level processing with patch-level subgraph encoding to capture localized spatial dependencies while leveraging a three-dimensional MLP-Mixer to handle temporal, spatial, and feature-based dependencies. Experiments on the AQI, ENGRAD, PV-US and METR-LA datasets demonstrate the model’s ability to effectively forecast even in the presence of significant missing data. While not surpassing state-of-the-art models in all scenarios, the T-GMM exhibits strong learning capabilities, particularly in capturing long-range dependencies. These results highlight its potential for robust, scalable spatiotemporal forecasting.

[LG-12] Surrogate-based multiscale analysis of experiments on thermoplastic composites under off-axis loading

链接: https://arxiv.org/abs/2501.10193
作者: M. A. Maia,I. B. C. M. Rocha,D. Kovačević,F. P. van der Meer
类目: Numerical Analysis (math.NA); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 21 pages. 31 figures

点击查看摘要

Abstract:In this paper, we present a surrogate-based multiscale approach to model constant strain-rate and creep experiments on unidirectional thermoplastic composites under off-axis loading. In previous contributions, these experiments were modeled through a single-scale micromechanical simulation under the assumption of macroscopic homogeneity. Although efficient and accurate in many scenarios, simulations with low-off axis angles showed significant discrepancies with the experiments. It was hypothesized that the mismatch was caused by macroscopic inhomogeneity, which would require a multiscale approach to capture it. However, full-field multiscale simulations remain computationally prohibitive. To address this issue, we replace the micromodel with a Physically Recurrent Neural Network (PRNN), a surrogate model that combines data-driven components with embedded constitutive models to capture history-dependent behavior naturally. The explainability of the latent space of this network is also explored in a transfer learning strategy that requires no re-training. With the surrogate-based simulations, we confirm the hypothesis raised on the inhomogeneity of the macroscopic strain field and gain insights into the influence of adjustment of the experimental setup with oblique end-tabs. Results from the surrogate-based multiscale approach show better agreement with experiments than the single-scale micromechanical approach over a wide range of settings, although with limited accuracy on the creep experiments, where macroscopic test effects were implicitly taken into account in the material properties calibration.

[LG-13] Improved learning rates in multi-unit uniform price auctions NEURIPS2024

链接: https://arxiv.org/abs/2501.10181
作者: Marius Potfer,Dorian Baudry,Hugo Richard,Vianney Perchet,Cheng Wan
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Motivated by the strategic participation of electricity producers in electricity day-ahead market, we study the problem of online learning in repeated multi-unit uniform price auctions focusing on the adversarial opposing bid setting. The main contribution of this paper is the introduction of a new modeling of the bid space. Indeed, we prove that a learning algorithm leveraging the structure of this problem achieves a regret of \tildeO(K^4/3T^2/3) under bandit feedback, improving over the bound of \tildeO(K^7/4T^3/4) previously obtained in the literature. This improved regret rate is tight up to logarithmic terms. Inspired by electricity reserve markets, we further introduce a different feedback model under which all winning bids are revealed. This feedback interpolates between the full-information and bandit scenarios depending on the auctions’ results. We prove that, under this feedback, the algorithm that we propose achieves regret \tildeO(K^5/2\sqrtT) .

[LG-14] Mean and Variance Estimation Complexity in Arbitrary Distributions via Wasserstein Minimization

链接: https://arxiv.org/abs/2501.10172
作者: Valentio Iverson,Stephen Vavasis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter estimation is a fundamental challenge in machine learning, crucial for tasks such as neural network weight fitting and Bayesian inference. This paper focuses on the complexity of estimating translation \boldsymbol\mu \in \mathbbR^l and shrinkage \sigma \in \mathbbR_++ parameters for a distribution of the form \frac1\sigma^l f_0 \left( \frac\boldsymbolx - \boldsymbol\mu\sigma \right) , where f_0 is a known density in \mathbbR^l given n samples. We highlight that while the problem is NP-hard for Maximum Likelihood Estimation (MLE), it is possible to obtain \varepsilon -approximations for arbitrary \varepsilon 0 within \textpoly \left( \frac1\varepsilon \right) time using the Wasserstein distance.

[LG-15] Convex Physics Informed Neural Networks for the Monge-Amp`ere Optimal Transport Problem

链接: https://arxiv.org/abs/2501.10162
作者: Alexandre Caboussat,Anna Peruso
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 17 pages, 14 figures. Submitted to Engineering Computations on 26 September 2024

点击查看摘要

Abstract:Optimal transportation of raw material from suppliers to customers is an issue arising in logistics that is addressed here with a continuous model relying on optimal transport theory. A physics informed neuralnetwork method is advocated here for the solution of the corresponding generalized Monge-Amp`ere equation. Convex neural networks are advocated to enforce the convexity of the solution to the Monge-Ampère equation and obtain a suitable approximation of the optimal transport map. A particular focus is set on the enforcement of transport boundary conditions in the loss function. Numerical experiments illustrate the solution to the optimal transport problem in several configurations, and sensitivity analyses are performed.

[LG-16] Visual Exploration of Stopword Probabilities in Topic Models

链接: https://arxiv.org/abs/2501.10137
作者: Shuangjiang Xue,Pierre Le Bras,David A. Robb,Mike J. Chantler,Stefano Padilla
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.

[LG-17] Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

链接: https://arxiv.org/abs/2501.10124
作者: Gongxu Luo,Haoyue Dai,Boyang Sun,Loka Li,Biwei Huang,Petar Stojanov,Kun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gene Regulatory Network Inference (GRNI) aims to identify causal relationships among genes using gene expression data, providing insights into regulatory mechanisms. A significant yet often overlooked challenge is selection bias, a process where only cells meeting specific criteria, such as gene expression thresholds, survive or are observed, distorting the true joint distribution of genes and thus biasing GRNI results. Furthermore, gene expression is influenced by latent confounders, such as non-coding RNAs, which add complexity to GRNI. To address these challenges, we propose GISL (Gene Regulatory Network Inference in the presence of Selection bias and Latent confounders), a novel algorithm to infer true regulatory relationships in the presence of selection and confounding issues. Leveraging data obtained via multiple gene perturbation experiments, we show that the true regulatory relationships, as well as selection processes and latent confounders can be partially identified without strong parametric models and under mild graphical assumptions. Experimental results on both synthetic and real-world single-cell gene expression datasets demonstrate the superiority of GISL over existing methods.

[LG-18] PaSa: An LLM Agent for Comprehensive Academic Paper Search

链接: https://arxiv.org/abs/2501.10120
作者: Yichen He,Guanhua Huang,Peiyuan Feng,Yuan Lin,Yuchen Zhang,Hang Li,Weinan E
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholarly queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4 for paraphrased queries, chatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50. It also exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at this https URL.

[LG-19] A recursive Bayesian neural network for constitutive modeling of sands under monotonic loading

链接: https://arxiv.org/abs/2501.10088
作者: Toiba Noor,Soban Nasir Lone,G. V. Ramana,Rajdip Nayek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In geotechnical engineering, constitutive models play a crucial role in describing soil behavior under varying loading conditions. Data-driven deep learning (DL) models offer a promising alternative for developing predictive constitutive models. When prediction is the primary focus, quantifying the predictive uncertainty of a trained DL model and communicating this uncertainty to end users is crucial for informed decision-making. This study proposes a recursive Bayesian neural network (rBNN) framework, which builds upon recursive feedforward neural networks (rFFNNs) by introducing generalized Bayesian inference for uncertainty quantification. A significant contribution of this work is the incorporation of a sliding window approach in rFFNNs, allowing the models to effectively capture temporal dependencies across load steps. The rBNN extends this framework by treating model parameters as random variables, with their posterior distributions inferred using generalized variational inference. The proposed framework is validated on two datasets: (i) a numerically simulated consolidated drained (CD) triaxial dataset employing a hardening soil model and (ii) an experimental dataset comprising 28 CD triaxial tests on Baskarp sand. Comparative analyses with LSTM, Bi-LSTM, and GRU models demonstrate that the deterministic rFFNN achieves superior predictive accuracy, attributed to its transparent structure and sliding window design. While the rBNN marginally trails in accuracy for the experimental case, it provides robust confidence intervals, addressing data sparsity and measurement noise in experimental conditions. The study underscores the trade-offs between deterministic and probabilistic approaches and the potential of rBNNs for uncertainty-aware constitutive modeling. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.10088 [cs.LG] (or arXiv:2501.10088v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.10088 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] wo-level Solar Irradiance Clustering with Season Identification: A Comparative Analysis

链接: https://arxiv.org/abs/2501.10084
作者: Roshni Agrawal,Sivakumar Subramanian,Venkataramana Runkana
类目: Machine Learning (cs.LG)
*备注: 30 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Solar irradiance clustering can enhance solar power capacity planning and help improve forecasting models by identifying similar irradiance patterns influenced by seasonal and weather changes. In this study, we adopt an efficient two-level clustering approach to automatically identify seasons using the clear sky irradiance in first level and subsequently to identify daily cloud level as clear, cloudy and partly cloudy within each season in second level. In the second level of clustering, three methods are compared, namely, Daily Irradiance Index (DII or \beta ), Euclidean Distance (ED), and Dynamic Time Warping (DTW) distance. The DII is computed as the ratio of time integral of measured irradiance to time integral of the clear sky irradiance. The identified clusters were compared quantitatively using established clustering metrics and qualitatively by comparing the mean irradiance profiles. The results clearly establish the superiority of the \beta -based clustering approach as the leader, setting a new benchmark for solar irradiance clustering studies. Moreover, \beta -based clustering remains effective even for annual data unlike the time-series methods which suffer significant performance degradation. Interestingly, contrary to expectations, ED-based clustering outperforms the more compute-intensive DTW distance-based clustering. The method has been rigorously validated using data from two distinct US locations, demonstrating robust scalability for larger datasets and potential applicability for other locations.

[LG-21] PandaSkill – Player Performance and Skill Rating in Esports: Application to League of Legends

链接: https://arxiv.org/abs/2501.10049
作者: Maxime De Bois,Flora Parmentier,Raphaël Puget,Matthew Tanti,Jordan Peltier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To take the esports scene to the next level, we introduce PandaSkill, a framework for assessing player performance and skill rating. Traditional rating systems like Elo and TrueSkill often overlook individual contributions and face challenges in professional esports due to limited game data and fragmented competitive scenes. PandaSkill leverages machine learning to estimate in-game player performance from individual player statistics. Each in-game role is modeled independently, ensuring a fair comparison between them. Then, using these performance scores, PandaSkill updates the player skill ratings using the Bayesian framework OpenSkill in a free-for-all setting. In this setting, skill ratings are updated solely based on performance scores rather than game outcomes, hightlighting individual contributions. To address the challenge of isolated rating pools that hinder cross-regional comparisons, PandaSkill introduces a dual-rating system that combines players’ regional ratings with a meta-rating representing each region’s overall skill level. Applying PandaSkill to five years of professional League of Legends matches worldwide, we show that our method produces skill ratings that better predict game outcomes and align more closely with expert opinions compared to existing methods.

[LG-22] Sparse Binary Representation Learning for Knowledge Tracing

链接: https://arxiv.org/abs/2501.09893
作者: Yahya Badran,Christine Preisach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge tracing (KT) models aim to predict students’ future performance based on their historical interactions. Most existing KT models rely exclusively on human-defined knowledge concepts (KCs) associated with exercises. As a result, the effectiveness of these models is highly dependent on the quality and completeness of the predefined KCs. Human errors in labeling and the cost of covering all potential underlying KCs can limit model performance. In this paper, we propose a KT model, Sparse Binary Representation KT (SBRKT), that generates new KC labels, referred to as auxiliary KCs, which can augment the predefined KCs to address the limitations of relying solely on human-defined KCs. These are learned through a binary vector representation, where each bit indicates the presence (one) or absence (zero) of an auxiliary KC. The resulting discrete representation allows these auxiliary KCs to be utilized in training any KT model that incorporates KCs. Unlike pre-trained dense embeddings, which are limited to models designed to accept such vectors, our discrete representations are compatible with both classical models, such as Bayesian Knowledge Tracing (BKT), and modern deep learning approaches. To generate this discrete representation, SBRKT employs a binarization method that learns a sparse representation, fully trainable via stochastic gradient descent. Additionally, SBRKT incorporates a recurrent neural network (RNN) to capture temporal dynamics and predict future student responses by effectively combining the auxiliary and predefined KCs. Experimental results demonstrate that SBRKT outperforms the tested baselines on several datasets and achieves competitive performance on others. Furthermore, incorporating the learned auxiliary KCs consistently enhances the performance of BKT across all tested datasets. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.09893 [cs.LG] (or arXiv:2501.09893v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.09893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Geometry-Preserving Encoder/Decoder in Latent Generative Models

链接: https://arxiv.org/abs/2501.09876
作者: Wonjun Lee,Riley C.W. O’Neill,Dongmian Zou,Jeff Calder,Gilad Lerman
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 41 pages

点击查看摘要

Abstract:Generative modeling aims to generate new data samples that resemble a given dataset, with diffusion models recently becoming the most popular generative model. One of the main challenges of diffusion models is solving the problem in the input space, which tends to be very high-dimensional. Recently, solving diffusion models in the latent space through an encoder that maps from the data space to a lower-dimensional latent space has been considered to make the training process more efficient and has shown state-of-the-art results. The variational autoencoder (VAE) is the most commonly used encoder/decoder framework in this domain, known for its ability to learn latent representations and generate data samples. In this paper, we introduce a novel encoder/decoder framework with theoretical properties distinct from those of the VAE, specifically designed to preserve the geometric structure of the data distribution. We demonstrate the significant advantages of this geometry-preserving encoder in the training process of both the encoder and decoder. Additionally, we provide theoretical results proving convergence of the training process, including convergence guarantees for encoder training, and results showing faster convergence of decoder training when using the geometry-preserving encoder.

[LG-24] An LLM -Guided Tutoring System for Social Skills Training

链接: https://arxiv.org/abs/2501.09870
作者: Michael Guevarra(1),Indronil Bhattacharjee(2),Srijita Das(3),Christabel Wayllace(2),Carrie Demmans Epp(4),Matthew E. Taylor(4 and 5),Alan Tay(1) ((1) Illumia Labs, (2) New Mexico State University, (3) University of Michigan-Dearborn, (4) University of Alberta, (5) Alberta Machine Intelligence Institute)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Social skills training targets behaviors necessary for success in social interactions. However, traditional classroom training for such skills is often insufficient to teach effective communication – one-to-one interaction in real-world scenarios is preferred to lecture-style information delivery. This paper introduces a framework that allows instructors to collaborate with large language models to dynamically design realistic scenarios for students to communicate. Our framework uses these scenarios to enable student rehearsal, provide immediate feedback, and visualize performance for both students and instructors. Unlike traditional intelligent tutoring systems, instructors can easily co-create scenarios with a large language model without technical skills. Additionally, the system generates new scenario branches in real time when existing options do not fit the student’s response.

[LG-25] Learning Noisy Halfspaces with a Margin: Massart is No Harder than Random NEURIPS2024

链接: https://arxiv.org/abs/2501.09851
作者: Gautam Chandrasekaran,Vasilis Kontonis,Konstantinos Stavropoulos,Kevin Tian
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: Appeared in NeurIPS 2024

点击查看摘要

Abstract:We study the problem of PAC learning \gamma -margin halfspaces with Massart noise. We propose a simple proper learning algorithm, the Perspectron, that has sample complexity \widetildeO((\epsilon\gamma)^-2) and achieves classification error at most \eta+\epsilon where \eta is the Massart noise rate. Prior works [DGT19,CKMY20] came with worse sample complexity guarantees (in both \epsilon and \gamma ) or could only handle random classification noise [DDK+23,KIT+23] – a much milder noise assumption. We also show that our results extend to the more challenging setting of learning generalized linear models with a known link function under Massart noise, achieving a similar sample complexity to the halfspace case. This significantly improves upon the prior state-of-the-art in this setting due to [CKMY20], who introduced this model.

[LG-26] Coded Deep Learning: Framework and Algorithm

链接: https://arxiv.org/abs/2501.09849
作者: En-hui Yang,Shayan Mohajer Hamidi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of deep learning (DL) is often achieved with large models and high complexity during both training and post-training inferences, hindering training in resource-limited settings. To alleviate these issues, this paper introduces a new framework dubbed ``coded deep learning’’ (CDL), which integrates information-theoretic coding concepts into the inner workings of DL, to significantly compress model weights and activations, reduce computational complexity at both training and post-training inference stages, and enable efficient model/data parallelism. Specifically, within CDL, (i) we first propose a novel probabilistic method for quantizing both model weights and activations, and its soft differentiable variant which offers an analytic formula for gradient calculation during training; (ii) both the forward and backward passes during training are executed over quantized weights and activations, eliminating most floating-point operations and reducing training complexity; (iii) during training, both weights and activations are entropy constrained so that they are compressible in an information-theoretic sense throughout training, thus reducing communication costs in model/data parallelism; and (iv) the trained model in CDL is by default in a quantized format with compressible quantized weights, reducing post-training inference and storage complexity. Additionally, a variant of CDL, namely relaxed CDL (R-CDL), is presented to further improve the trade-off between validation accuracy and compression though requiring full precision in training with other advantageous features of CDL intact. Extensive empirical results show that CDL and R-CDL outperform the state-of-the-art algorithms in DNN compression in the literature.

[LG-27] pFedWN: A Personalized Federated Learning Framework for D2D Wireless Networks with Heterogeneous Data

链接: https://arxiv.org/abs/2501.09822
作者: Zhou Ni,Masoud Ghazikor,Morteza Hashemi
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 16 pages, 9 figures, 3 tables, submitted to Transactions on Networking

点击查看摘要

Abstract:Traditional Federated Learning (FL) approaches often struggle with data heterogeneity across clients, leading to suboptimal model performance for individual clients. To address this issue, Personalized Federated Learning (PFL) emerges as a solution to the challenges posed by non-independent and identically distributed (non-IID) and unbalanced data across clients. Furthermore, in most existing decentralized machine learning works, a perfect communication channel is considered for model parameter transmission between clients and servers. However, decentralized PFL over wireless links introduces new challenges, such as resource allocation and interference management. To overcome these challenges, we formulate a joint optimization problem that incorporates the underlying device-to-device (D2D) wireless channel conditions into a server-free PFL approach. The proposed method, dubbed pFedWN, optimizes the learning performance for each client while accounting for the variability in D2D wireless channels. To tackle the formulated problem, we divide it into two sub-problems: PFL neighbor selection and PFL weight assignment. The PFL neighbor selection is addressed through channel-aware neighbor selection within unlicensed spectrum bands such as ISM bands. Next, to assign PFL weights, we utilize the Expectation-Maximization (EM) method to evaluate the similarity between clients’ data and obtain optimal weight distribution among the chosen PFL neighbors. Empirical results show that pFedWN provides efficient and personalized learning performance with non-IID and unbalanced datasets. Furthermore, it outperforms the existing FL and PFL methods in terms of learning efficacy and robustness, particularly under dynamic and unpredictable wireless channel conditions.

[LG-28] BN-Pool: a Bayesian Nonparametric Approach to Graph Pooling

链接: https://arxiv.org/abs/2501.09821
作者: Daniele Castellana,Filippo Maria Bianchi
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We introduce BN-Pool, the first clustering-based pooling method for Graph Neural Networks (GNNs) that adaptively determines the number of supernodes in a coarsened graph. By leveraging a Bayesian non-parametric framework, BN-Pool employs a generative model capable of partitioning graph nodes into an unbounded number of clusters. During training, we learn the node-to-cluster assignments by combining the supervised loss of the downstream task with an unsupervised auxiliary term, which encourages the reconstruction of the original graph topology while penalizing unnecessary proliferation of clusters. This adaptive strategy allows BN-Pool to automatically discover an optimal coarsening level, offering enhanced flexibility and removing the need to specify sensitive pooling ratios. We show that BN-Pool achieves superior performance across diverse benchmarks.

[LG-29] Graph Neural Networks for Travel Distance Estimation and Route Recommendation Under Probabilistic Hazards

链接: https://arxiv.org/abs/2501.09803
作者: Tong Liu,Hadi Meidani
类目: Machine Learning (cs.LG)
*备注: 17 pages, 11 figures

点击查看摘要

Abstract:Estimating the shortest travel time and providing route recommendation between different locations in a city or region can quantitatively measure the conditions of the transportation network during or after extreme events. One common approach is to use Dijkstra’s Algorithm, which produces the shortest path as well as the shortest distance. However, this option is computationally expensive when applied to large-scale networks. This paper proposes a novel fast framework based on graph neural networks (GNNs) which approximate the single-source shortest distance between pairs of locations, and predict the single-source shortest path subsequently. We conduct multiple experiments on synthetic graphs of different size to demonstrate the feasibility and computational efficiency of the proposed model. In real-world case studies, we also applied the proposed method of flood risk analysis of coastal urban areas to calculate delays in evacuation to public shelters during hurricanes. The results indicate the accuracy and computational efficiency of the GNN model, and its potential for effective implementation in emergency planning and management.

[LG-30] Multi-Head Self-Attending Neural Tucker Factorization

链接: https://arxiv.org/abs/2501.09776
作者: Yikai Hou,Peng Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quality-of-service (QoS) data exhibit dynamic temporal patterns that are crucial for accurately predicting missing values. These patterns arise from the evolving interactions between users and services, making it essential to capture the temporal dynamics inherent in such data for improved prediction performance. As the size and complexity of QoS datasets increase, existing models struggle to provide accurate predictions, highlighting the need for more flexible and dynamic methods to better capture the underlying patterns in large-scale QoS data. To address this issue, we introduce a neural network-based tensor factorization approach tailored for learning spatiotemporal representations of high-dimensional and incomplete (HDI) tensors, namely the Multi-head Self-attending Neural Tucker Factorization (MSNTucF). The model is elaborately designed for modeling intricate nonlinear spatiotemporal feature interaction patterns hidden in real world data with a two-fold idea. It first employs a neural network structure to generalize the traditional framework of Tucker factorization and then proposes to leverage a multi-head self-attending module to enforce nonlinear latent interaction learning. In empirical studies on two dynamic QoS datasets from real applications, the proposed MSNTucF model demonstrates superior performance compared to state-of-the-art benchmark models in estimating missing observations. This highlights its ability to learn non-linear spatiotemporal representations of HDI tensors.

[LG-31] DADA: Dual Averag ing with Distance Adaptation

链接: https://arxiv.org/abs/2501.10258
作者: Mohammad Moshtaghifar,Anton Rodomanov,Daniil Vankov,Sebastian Stich
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel universal gradient method for solving convex optimization problems. Our algorithm – Dual Averaging with Distance Adaptation (DADA) – is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a universal algorithm that simultaneously works for a broad spectrum of problem classes, provided the local growth of the objective function around its minimizer can be bounded. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, Hölder-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and (L_0,L_1) -smooth functions. Crucially, DADA is applicable to both unconstrained and constrained problems, even when the domain is unbounded, without requiring prior knowledge of the number of iterations or desired accuracy.

[LG-32] Amortized Bayesian Mixture Models

链接: https://arxiv.org/abs/2501.10229
作者: Šimon Kucharský,Paul Christian Bürkner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 34 pages, 17 figures

点击查看摘要

Abstract:Finite mixtures are a broad class of models useful in scenarios where observed data is generated by multiple distinct processes but without explicit information about the responsible process for each data point. Estimating Bayesian mixture models is computationally challenging due to issues such as high-dimensional posterior inference and label switching. Furthermore, traditional methods such as MCMC are applicable only if the likelihoods for each mixture component are analytically tractable. Amortized Bayesian Inference (ABI) is a simulation-based framework for estimating Bayesian models using generative neural networks. This allows the fitting of models without explicit likelihoods, and provides fast inference. ABI is therefore an attractive framework for estimating mixture models. This paper introduces a novel extension of ABI tailored to mixture models. We factorize the posterior into a distribution of the parameters and a distribution of (categorical) mixture indicators, which allows us to use a combination of generative neural networks for parameter inference, and classification networks for mixture membership identification. The proposed framework accommodates both independent and dependent mixture models, enabling filtering and smoothing. We validate and demonstrate our approach through synthetic and real-world datasets. Comments: 34 pages, 17 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO) Cite as: arXiv:2501.10229 [stat.ML] (or arXiv:2501.10229v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.10229 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach

链接: https://arxiv.org/abs/2501.10202
作者: Nicolas Atienza,Christophe Labreuche,Johanne Cohen,Michele Sebag
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: under review

点击查看摘要

Abstract:This paper introduces a novel method, Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE), which transforms a classifier into an abstaining classifier, offering provable protection against out-of-distribution and adversarial samples. The approach is based on a Generalized Extreme Value (GEV) model of the training distribution in the classifier’s latent space, enabling the formal characterization of OOD samples. Interestingly, under mild assumptions, the GEV model also allows for formally characterizing adversarial samples. The abstaining classifier, which rejects samples based on their assessment by the GEV model, provably avoids OOD and adversarial samples. The empirical validation of the approach, conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and medium and large-sized datasets (CIFAR-10, CIFAR-100, and ImageNet), demonstrates its frugality, stability, and efficiency compared to the state of the art.

[LG-34] Contributions to the Decision Theoretic Foundations of Machine Learning and Robust Statistics under Weakly Structured Information

链接: https://arxiv.org/abs/2501.10195
作者: Christoph Jansen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Habilitation Thesis

点击查看摘要

Abstract:This habilitation thesis is cumulative and, therefore, is collecting and connecting research that I (together with several co-authors) have conducted over the last few years. Thus, the absolute core of the work is formed by the ten publications listed on page 5 under the name Contributions 1 to 10. The references to the complete versions of these articles are also found in this list, making them as easily accessible as possible for readers wishing to dive deep into the different research projects. The chapters following this thesis, namely Parts A to C and the concluding remarks, serve to place the articles in a larger scientific context, to (briefly) explain their respective content on a less formal level, and to highlight some interesting perspectives for future research in their respective contexts. Naturally, therefore, the following presentation has neither the level of detail nor the formal rigor that can (hopefully) be found in the papers. The purpose of the following text is to provide the reader an easy and high-level access to this interesting and important research field as a whole, thereby, advertising it to a broader audience.

[LG-35] Double descent in quantum machine learning

链接: https://arxiv.org/abs/2501.10077
作者: Marie Kempkes,Aroosa Ijaz,Elies Gil-Fuster,Carlos Bravo-Prieto,Jakob Spiegelberg,Evert van Nieuwenburg,Vedran Dunjko
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The double descent phenomenon challenges traditional statistical learning theory by revealing scenarios where larger models do not necessarily lead to reduced performance on unseen data. While this counterintuitive behavior has been observed in a variety of classical machine learning models, particularly modern neural network architectures, it remains elusive within the context of quantum machine learning. In this work, we analytically demonstrate that quantum learning models can exhibit double descent behavior by drawing on insights from linear regression and random matrix theory. Additionally, our numerical experiments on quantum kernel methods across different real-world datasets and system sizes further confirm the existence of a test error peak, a characteristic feature of double descent. Our findings provide evidence that quantum models can operate in the modern, overparameterized regime without experiencing overfitting, thereby opening pathways to improved learning performance beyond traditional statistical learning theory.

[LG-36] racking student skills real-time through a continuous-variable dynamic Bayesian network

链接: https://arxiv.org/abs/2501.10050
作者: Hildo Bijl
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of Knowledge Tracing is focused on predicting the success rate of a student for a given skill. Modern methods like Deep Knowledge Tracing provide accurate estimates given enough data, but being based on neural networks they struggle to explain how these estimates are formed. More classical methods like Dynamic Bayesian Networks can do this, but they cannot give data on the accuracy of their estimates and often struggle to incorporate new observations in real-time due to their high computational load. This paper presents a novel method, Performance Distribution Tracing (PDT), in which the distribution of the success rate is traced live. It uses a Dynamic Bayesian Network with continuous random variables as nodes. By tracing the success rate distribution, there is always data available on the accuracy of any success rate estimation. In addition, it makes it possible to combine data from similar/related skills to come up with a more informed estimate of success rates. This makes it possible to predict exercise success rates, providing both explainability and an accuracy indication, even when an exercise requires a combination of different skills to solve. And through the use of the beta distribution functions as conjugate priors, all distributions are available in analytical form, allowing efficient online updates upon new observations. Experiments have shown that the resulting estimates generally feel sufficiently accurate to end-users such that they accept recommendations based on them. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2501.10050 [stat.ML] (or arXiv:2501.10050v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.10050 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Statistical Inference for Sequential Feature Selection after Domain Adaptation

链接: https://arxiv.org/abs/2501.09933
作者: Duong Tan Loc,Nguyen Thang Loi,Vo Nguyen Le Duy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In high-dimensional regression, feature selection methods, such as sequential feature selection (SeqFS), are commonly used to identify relevant features. When data is limited, domain adaptation (DA) becomes crucial for transferring knowledge from a related source domain to a target domain, improving generalization performance. Although SeqFS after DA is an important task in machine learning, none of the existing methods can guarantee the reliability of its results. In this paper, we propose a novel method for testing the features selected by SeqFS-DA. The main advantage of the proposed method is its capability to control the false positive rate (FPR) below a significance level \alpha (e.g., 0.05). Additionally, a strategic approach is introduced to enhance the statistical power of the test. Furthermore, we provide extensions of the proposed method to SeqFS with model selection criteria including AIC, BIC, and adjusted R-squared. Extensive experiments are conducted on both synthetic and real-world datasets to validate the theoretical results and demonstrate the proposed method’s superior performance.

[LG-38] SBAMDT: Bayesian Additive Decision Trees with Adaptive Soft Semi-multivariate Split Rules

链接: https://arxiv.org/abs/2501.09900
作者: Stamatina Lamprinakou,Huiyan Sang,Bledar A. Konomi,Ligang Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Bayesian Additive Regression Trees [BART, Chipman et al., 2010] have gained significant popularity due to their remarkable predictive performance and ability to quantify uncertainty. However, standard decision tree models rely on recursive data splits at each decision node, using deterministic decision rules based on a single univariate feature. This approach limits their ability to effectively capture complex decision boundaries, particularly in scenarios involving multiple features, such as spatial domains, or when transitions are either sharp or smoothly varying. In this paper, we introduce a novel probabilistic additive decision tree model that employs a soft split rule. This method enables highly flexible splits that leverage both univariate and multivariate features, while also respecting the geometric properties of the feature domain. Notably, the probabilistic split rule adapts dynamically across decision nodes, allowing the model to account for varying levels of smoothness in the regression function. We demonstrate the utility of the proposed model through comparisons with existing tree-based models on synthetic datasets and a New York City education dataset.

[LG-39] CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition ICASSP2025

链接: https://arxiv.org/abs/2501.09877
作者: Jingchen Sun,Shaobo Han,Wataru Kohno,Changyou Chen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in various acoustic signal recognition tasks. Fiber-optic-based acoustic recognition is one of the most important downstream tasks and plays a significant role in environmental sensing. Adapting CLAP for fiber-optic acoustic recognition has become an active research area. As a non-conventional acoustic sensor, fiber-optic acoustic recognition presents a challenging, domain-specific, low-shot deployment environment with significant domain shifts due to unique frequency response and noise characteristics. To address these challenges, we propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set, leveraging both implicit knowledge through fine-tuning and explicit knowledge retrieved from memory for cross-domain generalization. Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset. Our research also provides valuable insights for other downstream acoustic recognition tasks. The code and gunshot-firework dataset are available at this https URL.

[LG-40] Boosting the Accuracy of Stock Market Prediction via Multi-Layer Hybrid MTL Structure

链接: https://arxiv.org/abs/2501.09760
作者: Yuxi Hong
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate stock market prediction provides great opportunities for informed decision-making, yet existing methods struggle with financial data’s non-linear, high-dimensional, and volatile characteristics. Advanced predictive models are needed to effectively address these complexities. This paper proposes a novel multi-layer hybrid multi-task learning (MTL) framework aimed at achieving more efficient stock market predictions. It involves a Transformer encoder to extract complex correspondences between various input features, a Bidirectional Gated Recurrent Unit (BiGRU) to capture long-term temporal relationships, and a Kolmogorov-Arnold Network (KAN) to enhance the learning process. Experimental evaluations indicate that the proposed learning structure achieves great performance, with an MAE as low as 1.078, a MAPE as low as 0.012, and an R^2 as high as 0.98, when compared with other competitive networks.

信息检索

[IR-0] MechIR: A Mechanistic Interpretability Framework for Information Retrieval ECIR2025

链接: https://arxiv.org/abs/2501.10165
作者: Andrew Parry,Catherine Chen,Carsten Eickhoff,Sean MacAvaney
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 2 figures, Accepted to ECIR 2025 as a Demo Paper

点击查看摘要

Abstract:Mechanistic interpretability is an emerging diagnostic approach for neural models that has gained traction in broader natural language processing domains. This paradigm aims to provide attribution to components of neural systems where causal relationships between hidden layers and output were previously uninterpretable. As the use of neural models in IR for retrieval and evaluation becomes ubiquitous, we need to ensure that we can interpret why a model produces a given output for both transparency and the betterment of systems. This work comprises a flexible framework for diagnostic analysis and intervention within these highly parametric neural systems specifically tailored for IR tasks and architectures. In providing such a framework, we look to facilitate further research in interpretable IR with a broader scope for practical interventions derived from mechanistic interpretability. We provide preliminary analysis and look to demonstrate our framework through an axiomatic lens to show its applications and ease of use for those IR practitioners inexperienced in this emerging paradigm.

[IR-1] A Worrying Reproducibility Study of Intent-Aware Recommendation Models

链接: https://arxiv.org/abs/2501.10143
作者: Faisal Shehzad,Maurizio Ferrari Dacrema,Dietmar Jannach
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Lately, we have observed a growing interest in intent-aware recommender systems (IARS). The promise of such systems is that they are capable of generating better recommendations by predicting and considering the underlying motivations and short-term goals of consumers. From a technical perspective, various sophisticated neural models were recently proposed in this emerging and promising area. In the broader context of complex neural recommendation models, a growing number of research works unfortunately indicates that (i) reproducing such works is often difficult and (ii) that the true benefits of such models may be limited in reality, e.g., because the reported improvements were obtained through comparisons with untuned or weak baselines. In this work, we investigate if recent research in IARS is similarly affected by such problems. Specifically, we tried to reproduce five contemporary IARS models that were published in top-level outlets, and we benchmarked them against a number of traditional non-neural recommendation models. In two of the cases, running the provided code with the optimal hyperparameters reported in the paper did not yield the results reported in the paper. Worryingly, we find that all examined IARS approaches are consistently outperformed by at least one traditional model. These findings point to sustained methodological issues and to a pressing need for more rigorous scholarly practices.

[IR-2] Empirical Evaluation of Embedding Models in the Context of Text Classification in Document Review in Construction Delay Disputes

链接: https://arxiv.org/abs/2501.09859
作者: Fusheng Wei,Robert Neary,Han Qin,Qiang Mao,Jianping Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text embeddings are numerical representations of text data, where words, phrases, or entire documents are converted into vectors of real numbers. These embeddings capture semantic meanings and relationships between text elements in a continuous vector space. The primary goal of text embeddings is to enable the processing of text data by machine learning models, which require numerical input. Numerous embedding models have been developed for various applications. This paper presents our work in evaluating different embeddings through a comprehensive comparative analysis of four distinct models, focusing on their text classification efficacy. We employ both K-Nearest Neighbors (KNN) and Logistic Regression (LR) to perform binary classification tasks, specifically determining whether a text snippet is associated with ‘delay’ or ‘not delay’ within a labeled dataset. Our research explores the use of text snippet embeddings for training supervised text classification models to identify delay-related statements during the document review process of construction delay disputes. The results of this study highlight the potential of embedding models to enhance the efficiency and accuracy of document analysis in legal contexts, paving the way for more informed decision-making in complex investigative scenarios.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-20

目录

概览 (2025-01-20)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载