Arxiv今日论文 | 2025-01-15

本篇博文主要内容为 2025-01-15 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决如何评估大型语言模型（LLMs）在复杂策略游戏（如扑克）中的表现问题。扑克作为一种不完全信息游戏，要求玩家具备数学、推理、规划、策略以及对博弈论和人类心理的深刻理解，因此成为评估LLMs能力的理想场景。论文提出的解决方案是开发一个名为PokerBench的基准测试，该基准包含11,000个关键场景，分为翻牌前（pre-flop）和翻牌后（post-flop）两个阶段，并由专业扑克玩家协助开发。通过评估包括GPT-4、ChatGPT 3.5以及Llama和Gemma系列模型在内的多个先进LLMs，发现这些模型在未经微调的情况下表现不佳，但经过微调后性能显著提升。PokerBench的关键在于其能够快速、可靠地评估LLMs的扑克能力，并通过模型间的对战验证了基准分数与实际游戏胜率的相关性。此外，论文还指出简单的监督微调在优化游戏策略方面存在局限性，表明需要更先进的方法来有效训练LLMs在复杂游戏中的表现。

链接: https://arxiv.org/abs/2501.08328
作者: Richard Zhuang,Akshat Gupta,Richard Yang,Aniket Rahane,Zhengyu Li,Gopala Anumanchipalli
机构: 1. University of California, Berkeley(加州大学伯克利分校); 2. 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: AAAI 2025

点击查看摘要

Abstract:We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: \urlthis https URL.
zh

[NLP-1] Exploring Robustness of Multilingual LLM s on Real-World Noisy Data

【速读】：该论文旨在研究现实世界中的拼写错误对大型语言模型（LLMs）在不同自然语言处理（NLP）任务中性能的影响。具体来说，论文探讨了9个参数规模从0.2B到13B的语言模型在自然语言推理（NLI）、命名实体识别（NER）和意图分类（IC）三个任务中的表现。研究通过在6种不同语言上构建基于维基百科编辑历史的真实噪声词典，评估了这些模型在干净数据和噪声数据上的性能差异。结果表明，模型在干净数据和噪声数据上的平均性能差距在2.3到4.3个百分点之间。其中，mT5模型（特别是mT5 13B）在三个任务和六种语言中的四种中表现出更强的鲁棒性。解决方案的关键在于利用维基百科编辑历史构建真实噪声词典，并通过多语言和多任务的实验设计，系统评估了不同模型对拼写错误的鲁棒性。

链接: https://arxiv.org/abs/2501.08322
作者: Amirhossein Aliakbarzadeh,Lucie Flek,Akbar Karimi
机构: Conversational AI and Social Analytics (CAISA) Lab, University of Bonn, Germany(波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence, Germany(拉马尔机器学习和人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are trained on Web data that might contain spelling errors made by humans. But do they become robust to similar real-world noise? In this paper, we investigate the effect of real-world spelling mistakes on the performance of 9 language models, with parameters ranging from 0.2B to 13B, in 3 different NLP tasks, namely Natural Language Inference (NLI), Name Entity Recognition (NER), and Intent Classification (IC). We perform our experiments on 6 different languages and build a dictionary of real-world noise for them using the Wikipedia edit history. We show that the performance gap of the studied models on the clean and noisy test data averaged across all the datasets and languages ranges from 2.3 to 4.3 absolute percentage points. In addition, mT5 models, in general, show more robustness compared to BLOOM, Falcon, and BERT-like models. In particular, mT5 (13B), was the most robust on average overall, across the 3 tasks, and in 4 of the 6 languages.
zh

[NLP-2] Enhancing Automated Interpretability with Output-Centric Feature Descriptions

【速读】：该论文试图解决自动化可解释性管道（automated interpretability pipelines）在生成大型语言模型（LLMs）中特征的自然语言描述时，未能准确捕捉特征对模型输出的因果效应的问题。现有的方法主要依赖于激活特征的输入来生成描述，但这些描述无法充分反映特征激活如何影响模型输出。为了解决这一问题，论文提出了基于输出的高效方法，通过使用特征刺激后权重较高的词元（tokens）或直接将词汇“解嵌入”（unembedding）头应用于特征后的最高权重词元来生成特征描述。这些基于输出的描述方法能够更好地捕捉特征对模型输出的因果效应，而结合基于输入和基于输出的描述方法则能在输入和输出评估中取得最佳性能。此外，论文还展示了基于输出的描述方法可用于激活先前被认为“死亡”的特征。

链接: https://arxiv.org/abs/2501.08319
作者: Yoav Gur-Arieh,Roy Mayan,Chen Agassy,Atticus Geiger,Mor Geva
机构: Blavatnik School of Computer Science and AI, Tel Aviv University (特拉维夫大学布拉瓦特尼克计算机科学与人工智能学院); Pr(Ai)2R Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model’s representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary “unembedding” head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be “dead”.
zh

[NLP-3] MiniMax-01: Scaling Foundation Models with Lightning Attention

【速读】：该论文旨在解决大规模语言模型在处理长上下文时的计算效率和扩展性问题。解决方案的关键在于引入了闪电注意力机制（lightning attention）及其高效扩展技术，并结合专家混合模型（Mixture of Experts, MoE）来最大化计算能力。具体而言，模型包含32个专家，总参数量达到4560亿，每个token激活45.9亿参数。通过优化的并行策略和高效的计算-通信重叠技术，该模型能够在包含数百万token的上下文中进行高效的训练和推理。MiniMax-Text-01在训练时支持100万token的上下文窗口，推理时可外推至400万token，且成本可控。此外，MiniMax-VL-01通过5120亿视觉-语言token的持续训练构建，实验表明其在标准基准和内部基准上的性能与GPT-4o和Claude-3.5-Sonnet等顶尖模型相当，同时提供20-32倍的长上下文窗口。

链接: https://arxiv.org/abs/2501.08313
作者: MiniMax,Aonian Li,Bangwei Gong,Bo Yang,Boji Shan,Chang Liu,Cheng Zhu,Chunhao Zhang,Congchao Guo,Da Chen,Dong Li,Enwei Jiao,Gengxin Li,Guojun Zhang,Haohai Sun,Houze Dong,Jiadai Zhu,Jiaqi Zhuang,Jiayuan Song,Jin Zhu,Jingtao Han,Jingyang Li,Junbin Xie,Junhao Xu,Junjie Yan,Kaishun Zhang,Kecheng Xiao,Kexi Kang,Le Han,Leyang Wang,Lianfei Yu,Liheng Feng,Lin Zheng,Linbo Chai,Long Xing,Meizhi Ju,Mingyuan Chi,Mozhi Zhang,Peikai Huang,Pengcheng Niu,Pengfei Li,Pengyu Zhao,Qi Yang,Qidi Xu,Qiexiang Wang,Qin Wang,Qiuhui Li,Ruitao Leng,Shengmin Shi,Shuqi Yu,Sichen Li,Songquan Zhu,Tao Huang,Tianrun Liang,Weigao Sun,Weixuan Sun,Weiyu Cheng,Wenkai Li,Xiangjun Song,Xiao Su,Xiaodong Han,Xinjie Zhang,Xinzhu Hou,Xu Min,Xun Zou,Xuyang Shen,Yan Gong,Yingjie Zhu,Yipeng Zhou,Yiran Zhong,Yongyi Hu,Yuanxiang Fan,Yue Yu,Yufeng Yang,Yuhao Li,Yunan Huang,Yunji Li,Yunpeng Huang,Yunzhi Xu,Yuxin Mao,Zehan Li,Zekang Li,Zewei Tao,Zewen Ying,Zhaoyang Cong,Zhen Qin,Zhenhua Fan,Zhihang Yu,Zhuo Jiang,Zijia Wu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: A technical report from MiniMax. The authors are listed in alphabetical order. We open-sourced our MiniMax-01 at this https URL

点击查看摘要

Abstract:We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at this https URL.
zh

[NLP-4] Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages

【速读】：该论文试图解决当前物体命名数据集（object naming datasets）缺乏透明性和结构高度特异性的问题。这些数据集通常由概念列表和图片配对组成，用于研究人类如何从视觉刺激中提取语义概念并选择物体名称。然而，现有数据集的结构不一致，难以进行跨语言和跨研究的比较。论文提出了一种多语言、计算机辅助的解决方案，通过将不同物体命名列表中的个体项目链接到统一的概念上，从而增强数据的透明性和可比性。该研究整合了来自10种不同语系的30种语言的17个物体命名数据集，并通过比较这些数据集中的重复概念以及与传统历史语言学和语言类型学中的基本词汇列表进行对比，展示了如何利用这一整合数据集进行跨语言物体命名研究。这一方法为未来的物体命名任务研究提供了基础和改进方向。

链接: https://arxiv.org/abs/2501.08312
作者: Alžběta Kučerová,Johann-Mattis List
机构: University of Passau (帕绍大学)
类目: Computation and Language (cs.CL)
备注: To appear in the Proceedings of the Global WordNet Conference 2025

点击查看摘要

Abstract:Object naming - the act of identifying an object with a word or a phrase - is a fundamental skill in interpersonal communication, relevant to many disciplines, such as psycholinguistics, cognitive linguistics, or language and vision research. Object naming datasets, which consist of concept lists with picture pairings, are used to gain insights into how humans access and select names for objects in their surroundings and to study the cognitive processes involved in converting visual stimuli into semantic concepts. Unfortunately, object naming datasets often lack transparency and have a highly idiosyncratic structure. Our study tries to make current object naming data transparent and comparable by using a multilingual, computer-assisted approach that links individual items of object naming lists to unified concepts. Our current sample links 17 object naming datasets that cover 30 languages from 10 different language families. We illustrate how the comparative dataset can be explored by searching for concepts that recur across the majority of datasets and comparing the conceptual spaces of covered object naming datasets with classical basic vocabulary lists from historical linguistics and linguistic typology. Our findings can serve as a basis for enhancing cross-linguistic object naming research and as a guideline for future studies dealing with object naming tasks.
zh

[NLP-5] A Survey on Pedophile Attribution Techniques for Online Platforms

【速读】：该论文试图解决社交媒体平台上匿名性带来的性侵者（sexual predators）难以识别和管理的问题。尽管匿名性和便捷的访问为用户提供了便利的沟通手段，但也使得保护弱势用户免受性侵者的侵害变得困难。论文的核心解决方案是通过自动化识别系统（automated identification system）来将性侵者与其发布的文本内容进行关联（attribution），从而实现更有效的识别和管理。论文进一步探讨了嫌疑人集的大小、文本长度对关联任务的影响，并回顾了常用的数据集、特征、分类技术以及性能评估方法。研究发现，尽管已有一些研究提出了减少在线性侵者风险的工具，但尚未有研究能够提供有效的嫌疑人关联方法。最后，论文还列出了若干开放的研究问题。

链接: https://arxiv.org/abs/2501.08296
作者: Hiba Fallatah,Ching Suen,Olga Ormandjieva
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Reliance on anonymity in social media has increased its popularity on these platforms among all ages. The availability of public Wi-Fi networks has facilitated a vast variety of online content, including social media applications. Although anonymity and ease of access can be a convenient means of communication for their users, it is difficult to manage and protect its vulnerable users against sexual predators. Using an automated identification system that can attribute predators to their text would make the solution more attainable. In this survey, we provide a review of the methods of pedophile attribution used in social media platforms. We examine the effect of the size of the suspect set and the length of the text on the task of attribution. Moreover, we review the most-used datasets, features, classification techniques and performance measures for attributing sexual predators. We found that few studies have proposed tools to mitigate the risk of online sexual predators, but none of them can provide suspect attribution. Finally, we list several open research problems.
zh

[NLP-6] HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

【速读】：该论文旨在解决生成式大语言模型（LLMs）在生成高质量和流畅文本时产生的幻觉（hallucinations）问题，即模型生成的陈述与已有世界知识或输入上下文不符。为了解决这一问题，论文提出了HALoGEN基准测试框架，该框架包括：（1）10,923个涵盖九个领域（如编程、科学归因和摘要生成）的提示（prompts），用于评估生成模型；（2）针对每个用例的自动高精度验证器，将LLM生成的内容分解为原子单元，并与高质量知识源进行比对验证。通过这一框架，论文评估了14个语言模型的约150,000个生成结果，发现即使表现最好的模型也存在大量幻觉（在某些领域中，生成的原子事实错误率高达86%）。此外，论文还提出了一种新的错误分类方法，将LLM幻觉分为三类：可能源于训练数据错误回忆的Type A错误、训练数据中错误知识的Type B错误，以及完全虚构的Type C错误。该框架为系统研究生成模型幻觉的原因提供了基础，并推动了可信赖大语言模型的发展。

链接: https://arxiv.org/abs/2501.08292
作者: Abhilasha Ravichander,Shrusti Ghela,David Wadden,Yejin Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
zh

[NLP-7] AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

【速读】：该论文旨在解决全球南方地区（Global South）在仇恨言论（hate speech）和侮辱性语言（abusive language）识别与治理中的两个主要问题：一是缺乏有效的治理机制，二是过度依赖脱离上下文的关键词检测导致的审查问题。这些问题主要源于本地语言高质量数据的缺乏，以及本地社区在数据收集、标注和治理过程中的参与不足。为解决这些问题，论文提出了AfriHate，一个包含15种非洲语言的多语言仇恨言论和侮辱性语言数据集。该数据集的关键在于每个实例均由熟悉当地文化的母语者进行标注，确保了数据的文化相关性和准确性。此外，论文还探讨了数据集构建过程中的挑战，并提供了使用和不使用大语言模型（LLMs）的分类基线结果。

链接: https://arxiv.org/abs/2501.08284
作者: Shamsuddeen Hassan Muhammad,Idris Abdulmumin,Abinew Ali Ayele,David Ifeoluwa Adelani,Ibrahim Said Ahmad,Saminu Mohammad Aliyu,Nelson Odhiambo Onyango,Lilian D. A. Wanzare,Samuel Rutunda,Lukman Jibril Aliyu,Esubalew Alemneh,Oumaima Hourrane,Hagos Tesfahun Gebremichael,Elyas Abdi Ismail,Meriem Beloucif,Ebrahim Chekol Jibril,Andiswa Bukula,Rooweither Mabuya,Salomey Osei,Abigail Oppong,Tadesse Destaw Belay,Tadesse Kebede Guge,Tesfa Tegegne Asfaw,Chiamaka Ijeoma Chukwuneke,Paul Röttger,Seid Muhie Yimam,Nedjma Ousidhoum
机构: Imperial College London(帝国理工学院); Bayero University Kano(巴耶罗大学卡诺分校); DSFSI, University of Pretoria(比勒陀利亚大学 DSFSI); Bahir Dar University(巴赫达尔大学); Mila, McGill University & Canada CIFAR AI Chair(麦吉尔大学 Mila 研究所 & 加拿大 CIFAR AI 主席); Northeastern University(东北大学); Maseno University(马塞诺大学); Digital Umuganda(数字乌姆甘达); HausaNLP(豪萨语自然语言处理); Haramaya University(哈拉马亚大学); Al Akhawayn University(阿卡韦恩大学); Uppsala University(乌普萨拉大学); Istanbul Technical University(伊斯坦布尔技术大学); SADiLaR(南非数字语言资源); University of Deusto(德乌斯托大学); Independent Researcher(独立研究员); Instituto Politécnico Nacional(国立理工学院); Addis Ababa University(亚的斯亚贝巴大学); Lancaster University(兰卡斯特大学); Bocconi University(博科尼大学); University of Hamburg(汉堡大学); Cardiff University(卡迪夫大学); Wollo University(沃洛大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on this https URL
zh

[NLP-8] Exploring Robustness of LLM s to Sociodemographically-Conditioned Paraphrasing

【速读】：该论文试图解决大型语言模型（LLMs）在不同语言风格和跨社会人口统计学维度上的可靠性问题。尽管LLMs在各种自然语言处理（NLP）任务中表现出色，但其在不同语言变体中的鲁棒性仍存在疑虑。论文通过扩展SocialIQA数据集，创建了基于社会人口统计学风格的不同释义集，以进行结构化可靠性测试，评估LLMs在生成具有社会人口统计学特征的释义（通过工程化提示）以及在现实世界复杂语言场景中的推理能力。解决方案的关键在于通过引入多样化的释义集，结合困惑度（perplexity）、可解释性（explainability）和ATOMIC性能等指标，进行细粒度的可靠性分析，从而揭示语言变体对模型性能的显著影响。

链接: https://arxiv.org/abs/2501.08276
作者: Pulkit Arora,Akbar Karimi,Lucie Flek
机构: Conversational AI and Social Analytics (CAISA) Lab, University of Bonn(波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习和人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance in various NLP tasks. However, there are concerns about their reliability in different domains of linguistic variations. Many works have proposed robustness evaluation measures for local adversarial attacks, but we need globally robust models unbiased to different language styles. We take a broader approach to explore a wider range of variations across sociodemographic dimensions to perform structured reliability tests on the reasoning capacity of language models. We extend the SocialIQA dataset to create diverse paraphrased sets conditioned on sociodemographic styles. The assessment aims to provide a deeper understanding of LLMs in (a) their capability of generating demographic paraphrases with engineered prompts and (b) their reasoning capabilities in real-world, complex language scenarios. We also explore measures such as perplexity, explainability, and ATOMIC performance of paraphrases for fine-grained reliability analysis of LLMs on these sets. We find that demographic-specific paraphrasing significantly impacts the performance of language models, indicating that the subtleties of language variations remain a significant challenge. The code and dataset will be made available for reproducibility and future research.
zh

[NLP-9] Comparative Analysis of Efficient Adapter-Based Fine-Tuning of State-of-the-Art Transformer Models

【速读】：该论文旨在研究不同适配器架构（adapter architectures）在监督学习任务中的有效性，具体包括SuperGLUE基准测试中的二元分类任务以及Kaggle上的多类新闻分类任务。论文通过比较三种Transformer模型（DistilBERT、ELECTRA和BART）在传统微调（fine-tuning）和九种最先进的适配器架构下的分类性能和计算复杂度，探讨适配器架构是否能够在显著减少训练时间的同时，达到与微调相当或更好的性能。研究结果表明，适配器架构在多种任务中表现优异，能够作为微调的高效且灵活的替代方案。该研究为在不同自然语言处理（NLP）应用中选择和实现适配器提供了有价值的见解和指导。

链接: https://arxiv.org/abs/2501.08271
作者: Saad Mashkoor Siddiqui,Mohammad Ali Sheikh,Muhammad Aleem,Kajol R Singh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we investigate the efficacy of various adapter architectures on supervised binary classification tasks from the SuperGLUE benchmark as well as a supervised multi-class news category classification task from Kaggle. Specifically, we compare classification performance and time complexity of three transformer models, namely DistilBERT, ELECTRA, and BART, using conventional fine-tuning as well as nine state-of-the-art (SoTA) adapter architectures. Our analysis reveals performance differences across adapter architectures, highlighting their ability to achieve comparable or better performance relative to fine-tuning at a fraction of the training time. Similar results are observed on the new classification task, further supporting our findings and demonstrating adapters as efficient and flexible alternatives to fine-tuning. This study provides valuable insights and guidelines for selecting and implementing adapters in diverse natural language processing (NLP) applications.
zh

[NLP-10] Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

【速读】：该论文旨在解决长上下文语言模型（Long-Context Language Models, LCLMs）在检索增强生成（Retrieval-Augmented Generation, RAG）中的性能评估问题。现有基准测试（如LOFT）往往通过提供过于简化的上下文，高估了LCLMs的实际表现。为此，作者提出了一个新的基准测试ICR^2（In-Context Retrieval and Reasoning），通过引入由强检索器检索的混淆段落，评估LCLMs在更现实场景中的表现。解决方案的关键在于提出了三种方法来提升LCLMs的性能：(1) 检索后生成微调（retrieve-then-generate fine-tuning），(2) 检索注意力探测（retrieval-attention-probing），利用注意力头在解码过程中过滤和去噪长上下文，(3) 联合检索头与生成头的训练。这些方法显著提升了Mistral-7B模型在LOFT和ICR^2基准上的表现，甚至在某些任务上超越了GPT-4-Turbo。

链接: https://arxiv.org/abs/2501.08248
作者: Yifu Qiu,Varun Embar,Yizhe Zhang,Navdeep Jaitly,Shay B. Cohen,Benjamin Han
机构: University of Edinburgh(爱丁堡大学); Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly – a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.
zh

[NLP-11] ASTRID – An Automated and Scalable TRIaD for the Evaluation of RAG -based Clinical Question Answering Systems

【速读】：该论文试图解决在临床问答（Clinical QA）系统中，现有的自动检索增强生成（Retrieval Augmented Generation, RAG）评估指标在临床和对话场景中表现不佳的问题。当前的自动评估指标难以准确反映模型响应的真实性，而依赖人工评估则成本高、难以扩展，且不利于RAG系统的持续迭代开发。为解决这些问题，论文提出了ASTRID（Automated and Scalable TRIaD）评估框架，该框架包含三个关键指标：上下文相关性（Context Relevance, CR）、拒绝准确性（Refusal Accuracy, RA）和对话忠实度（Conversational Faithfulness, CF）。其中，CF是专门设计用于在不惩罚对话元素的情况下，更好地捕捉模型响应与知识库的忠实度。通过在白内障手术随访等真实场景中验证，论文表明CF能够比现有指标更好地预测人工评估的忠实度评分，并且ASTRID框架的三个指标与临床医生对不适当、有害或无益响应的评估具有一致性。此外，论文还展示了这些指标在多种大语言模型（LLMs）上的表现与人工评估高度一致，证明了其在LLM驱动的自动评估流程中的潜力。

链接: https://arxiv.org/abs/2501.08208
作者: Mohita Chowdhury,Yajie Vera He,Aisling Higham,Ernest Lim
机构: Ufonia Limited; University of York (约克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses. However, current automated RAG metrics perform poorly in clinical and conversational use cases. Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems. To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF). Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model’s response to the knowledge base without penalising conversational elements. To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, clinical, and non-clinical out-of-domain scenarios. We demonstrate that CF can predict human ratings of faithfulness better than existing definitions for conversational use cases. Furthermore, we show that evaluation using our triad consisting of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses. Finally, using nine different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines. We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development.
zh

[NLP-12] ArithmAttack: Evaluating Robustness of LLM s to Noisy Context in Math Problem Solving

【速读】：该论文旨在研究大语言模型（LLMs）在面对含有额外噪声（如标点符号）的输入时的鲁棒性（robustness）。尽管LLMs在数学问题解决任务中表现出色，但其对噪声输入的鲁棒性尚未得到充分研究。为此，作者提出了ArithmAttack方法，通过在不添加或删除上下文中的单词的情况下引入噪声（如标点符号），来评估LLMs的鲁棒性。该方法易于实现且不会导致信息丢失。实验在GSM8K和MultiArith数据集上对包括LLama3、Mistral和Mathstral在内的七种LLMs进行了评估，结果表明所有模型在面对此类噪声时均表现出脆弱性，且噪声越多，性能下降越明显。解决方案的关键在于通过ArithmAttack方法系统地引入噪声，从而揭示LLMs在噪声环境下的性能变化。

链接: https://arxiv.org/abs/2501.08203
作者: Zain Ul Abedin,Shahzeb Qamar,Lucie Flek,Akbar Karimi
机构: Conversational AI and Social Analytics (CAISA) Lab, University of Bonn, Germany(波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence, Germany(拉马尔机器学习和人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have shown impressive capabilities in math problem-solving tasks, their robustness to noisy inputs is not well-studied. In this work, we propose ArithmAttack to examine how robust the LLMs are when they encounter noisy prompts that contain extra noise in the form of punctuation marks. While being easy to implement, ArithmAttack does not cause any information loss since words are not added or deleted from the context. We evaluate the robustness of seven LLMs, including LLama3, Mistral, and Mathstral, on noisy GSM8K and MultiArith datasets. Our experiments suggest that all the studied models show vulnerability to such noise, with more noise leading to poorer performances.
zh

[NLP-13] CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

【速读】：该论文试图解决大语言模型（LLMs）在生成代码时存在的安全隐患问题，特别是功能正确但存在漏洞的代码难以被开发者识别的情况。当前的评估基准（如CyberSecEval和SecurityEval）由于规范不清晰且不切实际，无法同时准确评估代码的功能性和安全性。为解决这一问题，论文提出了CWEval，一种新颖的结果驱动评估框架，旨在通过高质量的任务规范和结果驱动的测试预言机（test oracles）来同时评估代码的功能性和安全性。CWEval结合了CWEval-bench，一个多语言、安全关键的编码基准，提供了对LLM生成代码的严格实证安全评估，克服了以往基准的不足。通过评估，CWEval揭示了LLM生成的功能正确但不安全的代码比例较高，并指出了以往评估的严重不准确性，从而为安全代码生成领域做出了重要贡献。

链接: https://arxiv.org/abs/2501.08200
作者: Jinjun Peng,Leyi Cui,Kele Huang,Junfeng Yang,Baishakhi Ray
机构: Department of Computer Science, Columbia University (哥伦比亚大学计算机科学系)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: to be published in LLM4Code 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks’ shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: this https URL .
zh

[NLP-14] OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

【速读】：该论文试图解决中文大语言模型（LLMs）在预训练过程中面临的高质量中文数据集稀缺的问题。这一问题限制了中文LLMs的性能表现。为解决这一挑战，论文提出了OpenCSG中文语料库（OpenCSG Chinese Corpus），该语料库包含多个高质量数据集，专门设计用于LLMs的预训练、后训练和微调。关键解决方案在于构建了四个具有不同特点的数据集：Fineweb-edu-chinese和Fineweb-edu-chinese-v2专注于从多样化的中文网络资源中筛选高质量内容；Cosmopedia-chinese提供合成、教科书风格的数据，适用于知识密集型训练；Smoltalk-chinese则强调风格多样化的聊天格式数据。这些数据集通过高质量文本、跨领域的广泛覆盖以及可扩展、可复现的数据处理流程，显著提升了中文LLMs在C-Eval等任务中的性能表现。

链接: https://arxiv.org/abs/2501.08197
作者: Yijiong Yu,Ziyun Dai,Zekun Wang,Wei Wang,Ran Chen,Ji Pei
机构: 未知
类目: Computation and Language (cs.CL)
备注: The datasets are available on this https URL ; The code is on this https URL

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.
zh

[NLP-15] A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following

【速读】：该论文试图解决在生命科学领域中，单细胞RNA测序（scRNA-seq）数据分析过程中存在的效率低下和操作不直观的问题。传统的分析工具在处理这种复杂的基因表达数据时，往往需要研究人员具备较高的技术门槛，且操作过程繁琐，限制了研究的深入和广泛性。为解决这一问题，论文提出了InstructCell，一种多模态AI助手，利用自然语言作为媒介，实现更直接和灵活的单细胞数据分析。其解决方案的关键在于构建了一个全面的多模态指令数据集，该数据集将基于文本的指令与来自不同组织和物种的scRNA-seq数据配对，并在此基础上开发了一种多模态细胞语言架构，能够同时解释和处理这两种模态。通过这种方式，研究人员可以使用简单的自然语言命令完成关键任务，如细胞类型注释、条件性伪细胞生成和药物敏感性预测，从而降低技术门槛，促进更深入的生物学洞察。

链接: https://arxiv.org/abs/2501.08187
作者: Yin Fang,Xinle Deng,Kangwei Liu,Ningyu Zhang,Jingyang Qian,Penghui Yang,Xiaohui Fan,Huajun Chen
机构: 1. Zhejiang University (浙江大学); 2. Alibaba-Zhejiang University Joint Institute of Frontier Technologies (阿里巴巴-浙江大学前沿技术联合研究院); 3. Zhejiang Lab (之江实验室); 4. College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); 5. School of Software Technology, Zhejiang University (浙江大学软件学院); 6. Institute of Artificial Intelligence, Zhejiang Lab (之江实验室人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
备注: 37 pages; 13 figures; Code: this https URL Models: this https URL , this https URL

点击查看摘要

Abstract:Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the “language of cellular biology”, capturing intricate gene expression patterns at the single-cell level. However, interacting with this “language” through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.
zh

[NLP-16] Potential and Perils of Large Language Models as Judges of Unstructured Textual Data

【速读】：该论文试图解决大语言模型（LLMs）在处理和总结非结构化文本数据（如开放式调查问卷）时，其生成的摘要是否能准确反映原始数据中的主题和情感的问题。随着组织越来越多地依赖这些强大的AI系统来分析文本反馈，确保LLMs输出的摘要与原始数据中的真实主题一致变得至关重要。论文的核心解决方案是采用LLMs作为“评判模型”（judge models），通过Anthropic Claude模型生成主题摘要，并使用Amazon的Titan Express、Nova Pro和Meta的Llama作为评判模型来评估这些摘要的主题一致性。研究通过Cohen’s kappa、Spearman’s rho和Krippendorff’s alpha等统计方法，将LLM评判结果与人类评估结果进行比较，验证了LLM评判模型在可扩展性方面与传统人类评估方法的可比性。研究结果表明，尽管LLMs在评判方面提供了可扩展的解决方案，但人类在检测细微的、特定于上下文的差异方面仍具有优势。

链接: https://arxiv.org/abs/2501.08167
作者: Rewina Bedemariam,Natalie Perez,Sreyoshi Bhaduri,Satya Kapoor,Alex Gil,Elizabeth Conjar,Ikkei Itoku,David Theil,Aman Chadha,Naumaan Nayyar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 11 pages, 1 appendix

点击查看摘要

Abstract:Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for organizations. This research investigates the effectiveness of LLMs as judge models to evaluate the thematic alignment of summaries generated by other LLMs. We utilized an Anthropic Claude model to generate thematic summaries from open-ended survey responses, with Amazon’s Titan Express, Nova Pro, and Meta’s Llama serving as LLM judges. The LLM-as-judge approach was compared to human evaluations using Cohen’s kappa, Spearman’s rho, and Krippendorff’s alpha, validating a scalable alternative to traditional human centric evaluation methods. Our findings reveal that while LLMs as judges offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances. This research contributes to the growing body of knowledge on AI assisted text analysis. We discuss limitations and provide recommendations for future research, emphasizing the need for careful consideration when generalizing LLM judge models across various contexts and use cases.
zh

[NLP-17] Refusal Behavior in Large Language Models : A Nonlinear Perspective

【速读】：该论文旨在探讨大型语言模型（LLMs）中的拒绝行为（refusal behavior），即模型在面对有害、不道德或不适当的提示时选择不回应的情况，以确保其符合伦理标准。研究通过分析来自三种架构家族的六个LLMs，挑战了拒绝行为是线性现象的假设。关键解决方案是采用降维技术（dimensionality reduction techniques），包括主成分分析（PCA）、t-分布随机邻域嵌入（t-SNE）和均匀流形逼近与投影（UMAP），揭示了拒绝机制具有非线性和多维特性，且这些特性因模型架构和层次的不同而有所差异。这些发现强调了非线性可解释性在改进对齐研究和指导更安全的AI部署策略中的重要性。

链接: https://arxiv.org/abs/2501.08145
作者: Fabian Hildebrandt,Andreas Maier,Patrick Krauss,Achim Schilling
机构: FAU Erlangen-Nürnberg (埃尔朗根-纽伦堡大学); University Hospital Erlangen (埃尔朗根大学医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Refusal behavior in large language models (LLMs) enables them to decline responding to harmful, unethical, or inappropriate prompts, ensuring alignment with ethical standards. This paper investigates refusal behavior across six LLMs from three architectural families. We challenge the assumption of refusal as a linear phenomenon by employing dimensionality reduction techniques, including PCA, t-SNE, and UMAP. Our results reveal that refusal mechanisms exhibit nonlinear, multidimensional characteristics that vary by model architecture and layer. These findings highlight the need for nonlinear interpretability to improve alignment research and inform safer AI deployment strategies.
zh

[NLP-18] In-situ graph reasoning and knowledge expansion using Graph-PReFLexOR

链接: https://arxiv.org/abs/2501.08120
作者: Markus J. Buehler
机构: 未知
类目: Artificial Intelligence (cs.AI); Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-19] Consistency of Responses and Continuations Generated by Large Language Models on Social Media

【速读】：该论文旨在探讨大型语言模型（LLMs）在处理社交媒体内容时的情感一致性和语义连贯性问题。具体而言，研究通过分析Twitter和Reddit上关于气候变化的讨论，评估了两种开源模型（Gemma和Llama）在情感过渡、情感强度模式以及语义相似性方面的表现。研究的关键解决方案包括使用延续任务和回应任务来评估模型生成内容的情感一致性和语义连贯性。研究发现，尽管两种模型在语义连贯性上表现良好，但在情感处理上存在差异：Gemma倾向于放大负面情绪（尤其是愤怒），同时保持一定的积极情绪（如乐观）；而Llama在更广泛的情感谱系中表现出更好的情感保持能力。此外，两种模型生成的回应情感强度普遍低于人类撰写的内容，并在回应任务中表现出对积极情绪的偏向。这些发现为LLMs在社交媒体环境中部署及人机交互设计提供了重要见解。

链接: https://arxiv.org/abs/2501.08102
作者: Wenlu Fan,Yuqi Zhu,Chenyang Wang,Bin Wang,Wentao Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using two open-source models: Gemma and Llama. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic similarity between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: Gemma shows a tendency toward negative emotion amplification, particularly anger, while maintaining certain positive emotions like optimism. Llama demonstrates superior emotional preservation across a broader spectrum of affects. Both models systematically generate responses with attenuated emotional intensity compared to human-authored content and show a bias toward positive emotions in response tasks. Additionally, both models maintain strong semantic similarity with original texts, though performance varies between continuation and response tasks. These findings provide insights into LLMs’ emotional and semantic processing capabilities, with implications for their deployment in social media contexts and human-AI interaction design.
zh

[NLP-20] Dynamic Multimodal Sentiment Analysis: Leverag ing Cross-Modal Attention for Enabled Classification

【速读】：该论文旨在解决多模态情感分析（multimodal sentiment analysis）中的情感分类问题，通过整合文本、音频和视觉数据来提升情感检测的准确性。其核心解决方案在于探索三种特征融合策略——晚期融合（late stage fusion）、早期融合（early stage fusion）和多头注意力机制（multi-headed attention），并在基于Transformer的架构中进行实验验证。研究结果表明，早期融合策略显著优于晚期融合，准确率达到71.87%，而多头注意力机制仅带来边际提升，准确率为72.39%。这表明在情感分类过程中，早期整合多模态数据能够更有效地捕捉模态间的复杂交互，而注意力机制在当前框架下的作用有限。未来研究将集中于优化特征融合技术、引入时序数据以及探索动态特征加权，以进一步提升模型性能。

链接: https://arxiv.org/abs/2501.08085
作者: Hui Lee,Singh Suniljit,Yong Siang Ong
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper explores the development of a multimodal sentiment analysis model that integrates text, audio, and visual data to enhance sentiment classification. The goal is to improve emotion detection by capturing the complex interactions between these modalities, thereby enabling more accurate and nuanced sentiment interpretation. The study evaluates three feature fusion strategies – late stage fusion, early stage fusion, and multi-headed attention – within a transformer-based architecture. Experiments were conducted using the CMU-MOSEI dataset, which includes synchronized text, audio, and visual inputs labeled with sentiment scores. Results show that early stage fusion significantly outperforms late stage fusion, achieving an accuracy of 71.87%, while the multi-headed attention approach offers marginal improvement, reaching 72.39%. The findings suggest that integrating modalities early in the process enhances sentiment classification, while attention mechanisms may have limited impact within the current framework. Future work will focus on refining feature fusion techniques, incorporating temporal data, and exploring dynamic feature weighting to further improve model performance.
zh

[NLP-21] Exploring Narrative Clustering in Large Language Models : A Layerwise Analysis of BERT

【速读】：该论文旨在探究基于Transformer的大型语言模型BERT的内部机制，特别是其在处理叙事内容和作者风格时的分层聚类能力。研究通过使用GPT-4生成的数据集，分析了BERT各层的激活模式，揭示了其在语义内容和风格特征上的处理差异。关键解决方案包括使用主成分分析（PCA）和多维尺度分析（MDS）等降维技术，发现BERT在后期层中对叙事内容表现出强烈的聚类能力，形成紧凑且独特的聚类，而对特定作者的风格特征则表现出较弱的聚类能力。这些发现表明BERT在处理语言信息时更优先考虑语义内容而非风格特征，为理解Transformer模型的表征能力和处理层次提供了新的见解。

链接: https://arxiv.org/abs/2501.08053
作者: Awritrojit Banerjee,Achim Schilling,Patrick Krauss
机构: FAU Erlangen-Nürnberg (埃尔朗根-纽伦堡大学); University Hospital Erlangen (埃尔朗根大学医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2408.03062 , arXiv:2408.04270 , arXiv:2307.01577

点击查看摘要

Abstract:This study investigates the internal mechanisms of BERT, a transformer-based large language model, with a focus on its ability to cluster narrative content and authorial style across its layers. Using a dataset of narratives developed via GPT-4, featuring diverse semantic content and stylistic variations, we analyze BERT’s layerwise activations to uncover patterns of localized neural processing. Through dimensionality reduction techniques such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS), we reveal that BERT exhibits strong clustering based on narrative content in its later layers, with progressively compact and distinct clusters. While strong stylistic clustering might occur when narratives are rephrased into different text types (e.g., fables, sci-fi, kids’ stories), minimal clustering is observed for authorial style specific to individual writers. These findings highlight BERT’s prioritization of semantic content over stylistic features, offering insights into its representational capabilities and processing hierarchy. This study contributes to understanding how transformer models like BERT encode linguistic information, paving the way for future interdisciplinary research in artificial intelligence and cognitive neuroscience.
zh

[NLP-22] READ: Reinforcement-based Adversarial Learning for Text Classification with Limited Labeled Data

【速读】：该论文试图解决在文本分类任务中，预训练模型（如BERT）需要大量标注数据才能取得优异性能的问题。由于获取标注数据通常成本高且耗时，而收集未标注数据则相对便宜，因此论文提出了一种结合强化学习（Reinforcement Learning）和半监督对抗学习（Semi-supervised Adversarial Learning）的新方法READ（Reinforcement-based Adversarial learning）。该方法的关键在于利用未标注数据集，通过强化学习生成多样化的合成文本，并通过对抗学习提升模型的泛化能力。实验结果表明，READ在多个数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.08035
作者: Rohit Sharma,Shanu Kumar,Avinash Kumar
机构: Microsoft Corporation, India (微软公司, 印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained transformer models such as BERT have shown massive gains across many text classification tasks. However, these models usually need enormous labeled data to achieve impressive performances. Obtaining labeled data is often expensive and time-consuming, whereas collecting unlabeled data using some heuristics is relatively much cheaper for any task. Therefore, this paper proposes a method that encapsulates reinforcement learning-based text generation and semi-supervised adversarial learning approaches in a novel way to improve the model’s performance. Our method READ, Reinforcement-based Adversarial learning, utilizes an unlabeled dataset to generate diverse synthetic text through reinforcement learning, improving the model’s generalization capability using adversarial learning. Our experimental results show that READ outperforms the existing state-of-art methods on multiple datasets.
zh

[NLP-23] riAdaptLoRA: Brain-Inspired Triangular Adaptive Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

【速读】：该论文旨在解决在大规模语言模型（LLMs）微调过程中，全参数微调（full fine-tuning）虽然性能优越但计算和资源成本高昂的问题。现有的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法，如LoRA，虽然通过减少可训练参数数量来应对这一挑战，但在秩调整效率和任务特定适应性方面仍存在不足。为此，论文提出了一种基于神经科学原理的新型PEFT框架——三角自适应低秩适应（Triangular Adaptive Low-Rank Adaptation, TriAdaptLoRA）。其关键创新包括：1）将变换矩阵三角分解为下三角和上三角部分以最大化参数利用率；2）基于归一化Frobenius范数的参数重要性度量，以实现高效适应；3）通过动态阈值控制的自适应秩增长策略，允许在训练步骤中灵活分配参数。实验表明，TriAdaptLoRA在多种自然语言理解和生成任务中均优于现有PEFT方法，表现出更高的性能、稳定性以及更低的计算开销，特别是在线性阈值驱动的秩增长条件下。

链接: https://arxiv.org/abs/2501.08008
作者: Yao Liang,Yuwei Wang,Yi Zeng
机构: Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China(中国科学院自动化研究所脑认知与脑机融合实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学人工智能学院); Center for Long-term Artificial Intelligence, Beijing 100190, China(长期人工智能研究中心); University of Chinese Academy of Sciences, Beijing 100083, China(中国科学院大学); Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Shanghai, 200031, China(中国科学院脑认知与类脑智能技术重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The fine-tuning of Large Language Models (LLMs) is pivotal for achieving optimal performance across diverse downstream tasks. However, while full fine-tuning delivers superior results, it entails significant computational and resource costs. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, address these challenges by reducing the number of trainable parameters, but they often struggle with rank adjustment efficiency and task-specific adaptability. We propose Triangular Adaptive Low-Rank Adaptation (TriAdaptLoRA), a novel PEFT framework inspired by neuroscience principles, which dynamically optimizes the allocation of trainable parameters. TriAdaptLoRA introduces three key innovations: 1) a triangular split of transformation matrices into lower and upper triangular components to maximize parameter utilization, 2) a parameter importance metric based on normalized Frobenius norms for efficient adaptation, and 3) an adaptive rank-growth strategy governed by dynamic thresholds, allowing flexible parameter allocation across training steps. Experiments conducted on a variety of natural language understanding and generation tasks demonstrate that TriAdaptLoRA consistently outperforms existing PEFT methods. It achieves superior performance, enhanced stability, and reduced computational overhead, particularly under linear threshold-driven rank growth. These results highlight its efficacy as a scalable and resource-efficient solution for fine-tuning LLMs.
zh

[NLP-24] Formalising lexical and syntactic diversity for data sampling in French

【速读】：该论文旨在解决数据集中多样性（diversity）采样的问题，特别是在创建数据集时如何有效地增加样本的多样性。由于寻找最优多样性样本的计算成本较高，作者提出了一种启发式方法，相较于随机采样，能够显著提高样本的多样性。此外，论文还探讨了不同类型的多样性（如词汇多样性（lexical diversity）和句法多样性（syntactic diversity））之间是否存在相关性，目的是通过低成本的词汇多样性采样来间接获取高成本的句法多样性样本。研究发现，不同数据集和多样性度量版本之间的相关性存在波动，这表明任意选择的多样性度量可能无法充分捕捉数据集中的多样性相关特性。因此，解决方案的关键在于提出一种高效的启发式采样方法，并揭示多样性度量选择对结果的影响。

链接: https://arxiv.org/abs/2501.08003
作者: Louis Estève,Manon Scholivet,Agata Savary
机构: Université Paris-Saclay (巴黎萨克雷大学); CNRS (法国国家科学研究中心); LISN (Laboratoire Interdisciplinaire des Sciences du Numérique, 数字科学跨学科实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diversity is an important property of datasets and sampling data for diversity is useful in dataset creation. Finding the optimally diverse sample is expensive, we therefore present a heuristic significantly increasing diversity relative to random sampling. We also explore whether different kinds of diversity – lexical and syntactic – correlate, with the purpose of sampling for expensive syntactic diversity through inexpensive lexical diversity. We find that correlations fluctuate with different datasets and versions of diversity measures. This shows that an arbitrarily chosen measure may fall short of capturing diversity-related properties of datasets.
zh

[NLP-25] “Wait did you mean the doctor?”: Collecting a Dialogue Corpus for Topical Analysis

【速读】：该论文试图解决的问题是如何在自然对话中识别和分析话题的组织结构，以及人们如何识别当前话题。由于现有文献中关于非正式对话中话题组织的描述较少，且分析对话话题需要足够长的对话数据以包含多个话题和话题转换类型，这类数据的收集和标注较为复杂。论文提出的解决方案是通过开发一个消息工具来进行对话收集实验，旨在构建一个适合话题分析的语料库。该解决方案的关键在于使用专门设计的工具来收集和标注长对话数据，以便进行深入的话题分析。

链接: https://arxiv.org/abs/2501.07947
作者: Amandine Decker(LORIA, UL, CNRS, SEMAGRAMME, GU),Vincent Tourneur(LORIA, UL, CNRS, SEMAGRAMME),Maxime Amblard(SEMAGRAMME, LORIA),Ellen Breitholtz(GU)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dialogue is at the core of human behaviour and being able to identify the topic at hand is crucial to take part in conversation. Yet, there are few accounts of the topical organisation in casual dialogue and of how people recognise the current topic in the literature. Moreover, analysing topics in dialogue requires conversations long enough to contain several topics and types of topic shifts. Such data is complicated to collect and annotate. In this paper we present a dialogue collection experiment which aims to build a corpus suitable for topical analysis. We will carry out the collection with a messaging tool we developed.
zh

[NLP-26] Gandalf the Red: Adaptive Security for LLM s

【速读】：该论文试图解决当前大型语言模型（LLM）应用中针对提示攻击（prompt attacks）防御评估的两个关键问题：一是对抗行为的动态性（dynamic nature of adversarial behavior），二是防御措施对合法用户可用性（usability）的负面影响。论文提出了D-SEC（Dynamic Security Utility Threat Model）模型，该模型明确区分攻击者与合法用户，模拟多步交互，并以可优化的形式严格表达安全性与可用性之间的权衡。此外，论文引入了Gandalf平台，这是一个众包、游戏化的红队（red-teaming）平台，旨在生成真实且自适应的攻击数据集。通过Gandalf，论文收集并发布了包含279k个提示攻击的数据集，并结合良性用户数据进行分析，揭示了安全性与可用性之间的相互作用。研究表明，集成在LLM中的防御措施（如系统提示）即使不阻止请求，也可能降低可用性。论文进一步证明，限制应用领域、深度防御（defense-in-depth）和自适应防御是构建安全且实用的LLM应用的有效策略。

链接: https://arxiv.org/abs/2501.07927
作者: Niklas Pfister,Václav Volhejn,Manuel Knott,Santiago Arias,Julia Bazińska,Mykhailo Bichurin,Alan Commike,Janet Darling,Peter Dienes,Matthew Fiedler,David Haber,Matthias Kraft,Marco Lancini,Max Mathys,Damián Pascual-Ortiz,Jakub Podolak,Adrià Romero-López,Kyriacos Shiarlis,Andreas Signer,Zsolt Terek,Athanasios Theocharis,Daniel Timbrell,Samuel Trautwein,Samuel Watts,Natalie Wu,Mateo Rojas-Carulla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Niklas Pfister, Václav Volhejn and Manuel Knott contributed equally

点击查看摘要

Abstract:Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and rigorously expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack datasets. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications. Code is available at \hrefthis https URL\textttthis https URL.
zh

[NLP-27] Exploring Aviation Incident Narratives Using Topic Modeling and Clustering Techniques

【速读】：该论文旨在通过分析航空事故报告中的叙述内容，识别潜在的主题和语义关系，并评估概率连接，从而全面理解航空事故的成因。研究利用美国国家运输安全委员会（NTSB）的数据集，应用了多种自然语言处理（NLP）技术，包括潜在狄利克雷分配（LDA）、非负矩阵分解（NMF）、潜在语义分析（LSA）、概率潜在语义分析（pLSA）和K-means聚类。这些技术的核心在于从复杂的文本数据中提取有价值的信息，揭示事故叙述中的潜在模式和主题结构。通过比较不同主题建模技术的表现，研究发现LDA在一致性值上表现最佳（0.597），其次是pLSA（0.583）、LSA（0.542）和NMF（0.437）。K-means聚类进一步揭示了事故叙述中的共性和独特见解。该研究为航空安全提供了新的见解，并为未来研究奠定了基础，包括探索时间模式、整合更多数据集以及开发预测模型以早期识别安全问题。

链接: https://arxiv.org/abs/2501.07924
作者: Aziida Nanyonga,Hassan Wasswa,Ugur Turhan,Keith Joiner,Graham Wild
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aviation safety is a global concern, requiring detailed investigations into incidents to understand contributing factors comprehensively. This study uses the National Transportation Safety Board (NTSB) dataset. It applies advanced natural language processing (NLP) techniques, including Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), and K-means clustering. The main objectives are identifying latent themes, exploring semantic relationships, assessing probabilistic connections, and cluster incidents based on shared characteristics. This research contributes to aviation safety by providing insights into incident narratives and demonstrating the versatility of NLP and topic modelling techniques in extracting valuable information from complex datasets. The results, including topics identified from various techniques, provide an understanding of recurring themes. Comparative analysis reveals that LDA performed best with a coherence value of 0.597, pLSA of 0.583, LSA of 0.542, and NMF of 0.437. K-means clustering further reveals commonalities and unique insights into incident narratives. In conclusion, this study uncovers latent patterns and thematic structures within incident narratives, offering a comparative analysis of multiple-topic modelling techniques. Future research avenues include exploring temporal patterns, incorporating additional datasets, and developing predictive models for early identification of safety issues. This research lays the groundwork for enhancing the understanding and improvement of aviation safety by utilising the wealth of information embedded in incident narratives.
zh

[NLP-28] Aviation Safety Enhancement via NLP Deep Learning: Classifying Flight Phases in ATSB Safety Reports

【速读】：该论文旨在解决航空安全报告中不同飞行阶段（flight phases）的自动分类问题，以提高安全事件分析的精确性和效率。研究采用了自然语言处理（Natural Language Processing, NLP）和深度学习模型，包括长短期记忆网络（LSTM）、卷积神经网络（CNN）、双向长短期记忆网络（BLSTM）和简单循环神经网络（sRNN），对澳大利亚运输安全局（ATSB）的安全报告进行分析。其中，LSTM模型在准确率、精确率、召回率和F1分数上表现最佳，分别达到87%、88%、87%和88%。这些结果表明，NLP与深度学习技术的结合能够有效自动化安全事件分析，从而为航空安全提供更有针对性的措施和更高效的报告处理流程。

链接: https://arxiv.org/abs/2501.07923
作者: Aziida Nanyonga,Hassan Wasswa,Graham Wild
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: NLP, Aviation Safety, ATSB, Deep learning, Flight phase. arXiv admin note: substantial text overlap with arXiv:2501.01694

点击查看摘要

Abstract:Aviation safety is paramount, demanding precise analysis of safety occurrences during different flight phases. This study employs Natural Language Processing (NLP) and Deep Learning models, including LSTM, CNN, Bidirectional LSTM (BLSTM), and simple Recurrent Neural Networks (sRNN), to classify flight phases in safety reports from the Australian Transport Safety Bureau (ATSB). The models exhibited high accuracy, precision, recall, and F1 scores, with LSTM achieving the highest performance of 87%, 88%, 87%, and 88%, respectively. This performance highlights their effectiveness in automating safety occurrence analysis. The integration of NLP and Deep Learning technologies promises transformative enhancements in aviation safety analysis, enabling targeted safety measures and streamlined report handling.
zh

[NLP-29] GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism

【速读】：该论文试图解决传统混合专家网络（Mixture-of-Experts, MoE）中专家模型独立运行的问题，探讨是否通过专家模型之间的互联可以提升MoE网络的性能。为此，作者提出了GRAPHMOE，一种基于伪图MoE网络（Pseudo GraphMoE）的自反思机制，旨在增强语言模型的认知深度。解决方案的关键在于引入了一种循环路由策略（recurrent routing strategy），通过模拟迭代思维步骤，促进专家节点之间的信息流动。此外，GRAPHMOE采用了低秩适应技术（Low-Rank Adaptation, LoRA）来实现其架构，并在多个基准数据集上进行了广泛实验，结果表明GRAPHMOE在性能上优于其他基于LoRA的模型，达到了当前最先进的水平（state-of-the-art, SOTA）。该研究还探索了一种新的循环路由策略，可能为提升语言模型的推理能力提供新的思路。

链接: https://arxiv.org/abs/2501.07890
作者: Chen Tang,Bo Lv,Zifan Zheng,Bohao Yang,Kun Zhao,Ning Liao,Xiaoxing Wang,Feiyu Xiong,Zhiyu Li,Nayu Liu,Jingchi Jiang
机构: Institute for Advanced Algorithms Research, Shanghai(上海高级算法研究所); Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); The University of Manchester(曼彻斯特大学); The University of Pittsburgh(匹兹堡大学); National Key Laboratory of Smart Farm Technologies and Systems, Harbin Institute of Technology(哈尔滨工业大学智能农场技术与系统国家重点实验室); University of Sydney(悉尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Traditional Mixture-of-Experts (MoE) networks benefit from utilizing multiple smaller expert models as opposed to a single large network. However, these experts typically operate independently, leaving a question open about whether interconnecting these models could enhance the performance of MoE networks. In response, we introduce GRAPHMOE, a novel method aimed at augmenting the cognitive depth of language models via a self-rethinking mechanism constructed on Pseudo GraphMoE networks. GRAPHMOE employs a recurrent routing strategy to simulate iterative thinking steps, thereby facilitating the flow of information among expert nodes. We implement the GRAPHMOE architecture using Low-Rank Adaptation techniques (LoRA) and conduct extensive experiments on various benchmark datasets. The experimental results reveal that GRAPHMOE outperforms other LoRA based models, achieving state-of-the-art (SOTA) performance. Additionally, this study explores a novel recurrent routing strategy that may inspire further advancements in enhancing the reasoning capabilities of language models.
zh

[NLP-30] Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

【速读】：该论文探讨了在不可靠的监督下，语言模型（LM）的后训练（post-training）是否仍然有效的问题。随着语言模型能力的提升，监督任务的复杂性增加，导致监督的可靠性下降。论文通过模拟不可靠的演示和比较反馈，发现监督微调（SFT）在不可靠监督下仍有一定效果，但常用的基于人类反馈的强化学习（RLHF）算法DPO无法在SFT基础上进一步提升模型性能。为解决这一问题，论文提出了迭代标签精炼（ILR）作为RLHF的替代方案。ILR通过使用比较反馈来决定是否用模型生成的替代方案替换人类演示，从而改进SFT数据，并在更新后的数据上重新训练模型。实验表明，在数学、编程和安全指令遵循等任务中，SFT+ILR在不可靠监督下的表现优于SFT+DPO。研究结果表明，在复杂任务中，当人类监督不可靠时，RLHF可能不再是利用人类比较反馈的最佳方式，而应将反馈用于改进训练数据而非持续训练模型。

链接: https://arxiv.org/abs/2501.07886
作者: Yaowen Ye,Cassidy Laidlaw,Jacob Steinhardt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 10 figures

点击查看摘要

Abstract:Language model (LM) post-training relies on two stages of human supervision: task demonstrations for supervised finetuning (SFT), followed by preference comparisons for reinforcement learning from human feedback (RLHF). As LMs become more capable, the tasks they are given become harder to supervise. Will post-training remain effective under unreliable supervision? To test this, we simulate unreliable demonstrations and comparison feedback using small LMs and time-constrained humans. We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO (a common RLHF algorithm) fails to improve the model beyond SFT. To address this, we propose iterative label refinement (ILR) as an alternative to RLHF. ILR improves the SFT data by using comparison feedback to decide whether human demonstrations should be replaced by model-generated alternatives, then retrains the model via SFT on the updated data. SFT+ILR outperforms SFT+DPO on several tasks with unreliable supervision (math, coding, and safe instruction-following). Our findings suggest that as LMs are used for complex tasks where human supervision is unreliable, RLHF may no longer be the best use of human comparison feedback; instead, it is better to direct feedback towards improving the training data rather than continually training the model. Our code and data are available at this https URL.
zh

[NLP-31] Continual Learning with Embedding Layer Surgery and Task-wise Beam Search using Whisper

【速读】：该论文旨在解决多语言自动语音识别（Multilingual ASR）模型在持续学习（Continual Learning, CL）过程中面临的灾难性遗忘（Catastrophic Forgetting, CF）问题。具体来说，现有的CL方法在添加新语言时，忽略了解码器中的词嵌入查找表（token embedding lookup table）的适配问题，而这一部分对CF有显著影响。论文提出了一种称为“嵌入层手术”（Embedding Layer Surgery）的解决方案，即为每个新语言创建独立的词嵌入副本，并在转录相应新语言时选择其中一个副本来替换旧语言的嵌入。然而，这种方法可能导致语言识别（LID）错误，进而引发错误的ASR嵌入选择。为此，论文进一步提出了任务导向的束搜索（Task-wise Beam Search）机制，以自我纠正此类错误。通过在Common Voice数据集上对10种未见语言进行实验，结果表明，该方法在不影响未见语言的平均词错误率（AWER）的情况下，将预训练语言的平均词错误率从14.2%降低至11.9%，优于现有的经验回放（Experience Replay）方法。

链接: https://arxiv.org/abs/2501.07875
作者: Chin Yuen Kwok,Jia Qi Yip,Eng Siong Chng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in 2024 IEEE Spoken Language Technology Workshop

点击查看摘要

Abstract:Current Multilingual ASR models only support a fraction of the world’s languages. Continual Learning (CL) aims to tackle this problem by adding new languages to pre-trained models while avoiding the loss of performance on existing languages, also known as Catastrophic Forgetting (CF). However, existing CL methods overlook the adaptation of the token embedding lookup table at the decoder, despite its significant contribution to CF. We propose Embedding Layer Surgery where separate copies of the token embeddings are created for each new languages, and one of the copies is selected to replace the old languages embeddings when transcribing the corresponding new language. Unfortunately, this approach means LID errors also cause incorrect ASR embedding selection. Our Task-wise Beam Search allows self-correction for such mistakes. By adapting Whisper to 10 hours of data for each of 10 unseen languages from Common Voice, results show that our method reduces the Average WER (AWER) of pre-trained languages from 14.2% to 11.9% compared with Experience Replay, without compromising the AWER of the unseen languages.
zh

[NLP-32] ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统在复杂多步推理任务中的局限性。具体问题包括：缺乏解释性、过程奖励模型（Process Reward Model, PRM）训练数据的偏差、PRM评分中的早期步骤偏差，以及推理潜力的后训练优化不足。为解决这些问题，论文提出了“通过可信过程奖励的检索增强推理”（ReARTeR）框架。该框架的关键在于：在测试时引入可信过程奖励，通过过程奖励模型进行精确的标量评分，并通过过程解释模型（Process Explanation Model, PEM）生成自然语言解释以优化推理步骤；在后训练阶段，利用蒙特卡洛树搜索（Monte Carlo Tree Search）结合可信过程奖励收集高质量的步骤级偏好数据，并通过迭代偏好优化进行优化。ReARTeR通过离策略偏好学习、平衡标注方法以及基于时间差的前瞻搜索策略，分别解决了PRM与PEM之间的不对齐、PRM训练数据偏差和早期步骤偏差等核心挑战。实验结果表明，ReARTeR在多步推理基准测试中显著提升了RAG系统的推理能力。

链接: https://arxiv.org/abs/2501.07861
作者: Zhongxiang Sun,Qipeng Wang,Weijie Yu,Xiaoxue Zang,Kai Zheng,Jun Xu,Xiao Zhang,Song Yang,Han Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems for Large Language Models (LLMs) hold promise in knowledge-intensive tasks but face limitations in complex multi-step reasoning. While recent methods have integrated RAG with chain-of-thought reasoning or test-time search using Process Reward Models (PRMs), these approaches encounter challenges such as a lack of explanations, bias in PRM training data, early-step bias in PRM scores, and insufficient post-training optimization of reasoning potential. To address these issues, we propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR), a framework that enhances RAG systems’ reasoning capabilities through post-training and test-time scaling. At test time, ReARTeR introduces Trustworthy Process Rewarding via a Process Reward Model for accurate scalar scoring and a Process Explanation Model (PEM) for generating natural language explanations, enabling step refinement. During post-training, it utilizes Monte Carlo Tree Search guided by Trustworthy Process Rewarding to collect high-quality step-level preference data, optimized through Iterative Preference Optimization. ReARTeR addresses three core challenges: (1) misalignment between PRM and PEM, tackled through off-policy preference learning; (2) bias in PRM training data, mitigated by balanced annotation methods and stronger annotations for challenging examples; and (3) early-step bias in PRM, resolved through a temporal-difference-based look-ahead search strategy. Experimental results on multi-step reasoning benchmarks demonstrate significant improvements, underscoring ReARTeR’s potential to advance the reasoning capabilities of RAG systems.
zh

[NLP-33] Optimizing Language Models for Grammatical Acceptability: A Comparative Study of Fine-Tuning Techniques

【速读】：该论文探讨了如何通过微调（Fine-Tuning, FT）Open Pre-trained Transformer (OPT-125M) 模型来提升其在语法可接受性任务（使用CoLA数据集）中的表现。研究的关键在于比较了多种微调方法，包括传统的Vanilla-Fine-Tuning (VFT)、基于模式的微调（Pattern-Based-Fine-Tuning, PBFT）以及参数高效的微调技术（Parameter-Efficient Fine-Tuning, PEFT），如低秩适应（Low-Rank Adaptation, LoRA）。研究结果表明，虽然VFT在准确率上表现最佳（81.2%），但LoRA在减少内存使用和迭代时间方面显著提升了计算效率（减少超过50%），并在PBFT情况下提高了准确率。尽管上下文蒸馏（Context Distillation, CD）在计算效率上表现良好，但其准确率较低（约31%）。这些发现有助于通过降低计算门槛，使更多人能够使用大型语言模型（LLM）。

链接: https://arxiv.org/abs/2501.07853
作者: Shobhit Ratan,Farley Knight,Ghada Jerfel,Sze Chung Ho
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores the fine-tuning (FT) of the Open Pre-trained Transformer (OPT-125M) for grammatical acceptability tasks using the CoLA dataset. By comparing Vanilla-Fine-Tuning (VFT), Pattern-Based-Fine-Tuning (PBFT), and Parameter-Efficient Fine-Tuning techniques (PEFT) like Low-Rank Adaptation (LoRA), we demonstrate significant improvements in computational efficiency while maintaining high accuracy. Our experiments reveal that while VFT achieves the highest accuracy (81.2%), LoRA enhancing FT by reducing memory usage and iteration time by more than 50%, and increases accuracy in PBFT case. Context Distillation (CD), though computationally efficient, underperformed with accuracy around 31%. Our findings contribute to democratizing access to large language models (LLM) by reducing computational barriers.
zh

[NLP-34] Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLM s Reasoning

【速读】：该论文试图解决大语言模型（LLMs）在处理需要理解和推断文本序列中不同信息之间关系的推理任务时所面临的挑战，特别是在涉及多步过程的逻辑推理和多跳问答任务中。这些任务要求模型理解实体之间的隐含关系并利用上下文中的多跳连接，而现有的大语言模型在这些任务上表现不佳。论文提出的解决方案关键在于通过从上下文中构建显式图（explicit graphs），并利用这些图来增强大语言模型的推理能力。具体来说，论文提出了“基于图的推理”（Reasoning with Graphs, RwG）方法，通过将上下文中的隐含知识结构化为图，从而提升大语言模型在逻辑推理和多跳问答任务中的表现。实验结果表明，该方法在提升推理性能方面具有显著效果。

链接: https://arxiv.org/abs/2501.07845
作者: Haoyu Han,Yaochen Xie,Hui Liu,Xianfeng Tang,Sreyashi Nag,William Headden,Hui Liu,Yang Li,Chen Luo,Shuiwang Ji,Qi He,Jiliang Tang
机构: Michigan State University(密歇根州立大学); Amazon(亚马逊); Texas A&M University(德州农工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks; however, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between distinct pieces of information within text sequences. This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop question answering, where understanding implicit relationships between entities and leveraging multi-hop connections in the given context are crucial. Graphs, as fundamental data structures, explicitly represent pairwise relationships between entities, thereby offering the potential to enhance LLMs’ reasoning capabilities. External graphs have proven effective in supporting LLMs across multiple tasks. However, in many reasoning tasks, no pre-existing graph structure is provided. Can we structure implicit knowledge derived from context into graphs to assist LLMs in reasoning? In this paper, we propose Reasoning with Graphs (RwG) by first constructing explicit graphs from the context and then leveraging these graphs to enhance LLM reasoning performance on reasoning tasks. Extensive experiments demonstrate the effectiveness of the proposed method in improving both logical reasoning and multi-hop question answering tasks.
zh

[NLP-35] Social Media Data Mining With Natural Language Processing on Public Dream Contents

【速读】：该论文旨在探讨COVID-19大流行对心理健康的影响，特别是通过分析Reddit r/Dreams社区中用户分享的梦境内容来揭示大流行期间人们潜意识中的心理变化。研究的关键解决方案包括使用统计方法评估从大流行前到大流行后梦境内容的积极性、消极性和中立性的变化，并通过微调LLaMA 3.1-8B模型（LLaMA 3.1-8B model）对梦境内容进行精确的情感分类。这一方法使得研究者能够深入分析梦境内容中的情感模式，从而揭示大流行对公众心理健康的潜在影响，并探讨梦境作为公共福祉指标的作用。

链接: https://arxiv.org/abs/2501.07839
作者: Howard Hua,Joe Yu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:The COVID-19 pandemic has significantly transformed global lifestyles, enforcing physical isolation and accelerating digital adoption for work, education, and social interaction. This study examines the pandemic’s impact on mental health by analyzing dream content shared on the Reddit r/Dreams community. With over 374,000 subscribers, this platform offers a rich dataset for exploring subconscious responses to the pandemic. Using statistical methods, we assess shifts in dream positivity, negativity, and neutrality from the pre-pandemic to post-pandemic era. To enhance our analysis, we fine-tuned the LLaMA 3.1-8B model with labeled data, enabling precise sentiment classification of dream content. Our findings aim to uncover patterns in dream content, providing insights into the psychological effects of the pandemic and its influence on subconscious processes. This research highlights the profound changes in mental landscapes and the role of dreams as indicators of public well-being during unprecedented times.
zh

[NLP-36] Real-time Verification and Refinement of Language Model Text Generation

【速读】：该论文试图解决大语言模型（LLMs）在生成自然语言任务中有时会产生事实性错误的问题。尽管已有许多研究致力于识别和修正这些错误，但这些方法通常需要在模型生成完整响应（从第一个到最后一个token）后才能进行验证，导致部署效率较低。此外，论文指出，一旦LLMs在早期生成错误的token，后续token也更容易出现事实性错误。为此，论文提出了一种名为Streaming-VR（流式验证与修正）的新方法，其关键创新在于能够在token生成过程中实时进行验证和修正，类似于流式处理。通过这种方式，Streaming-VR能够在LLM构建响应的同时，实时检查和修正每个token子集，从而显著提高事实准确性，并比现有修正方法更高效。

链接: https://arxiv.org/abs/2501.07824
作者: Joonho Ko,Jinheon Baek,Sung Ju Hwang
机构: KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, a critical challenge remains in that they sometimes generate factually incorrect answers. To address this, while many previous work has focused on identifying errors in their generation and further refining them, they are slow in deployment since they are designed to verify the response from LLMs only after their entire generation (from the first to last tokens) is done. Further, we observe that once LLMs generate incorrect tokens early on, there is a higher likelihood that subsequent tokens will also be factually incorrect. To this end, in this work, we propose Streaming-VR (Streaming Verification and Refinement), a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs. Specifically, the proposed Streaming-VR enables on-the-fly verification and correction of tokens as they are being generated, similar to a streaming process, ensuring that each subset of tokens is checked and refined in real-time by another LLM as the LLM constructs its response. Through comprehensive evaluations on multiple datasets, we demonstrate that our approach not only enhances the factual accuracy of LLMs, but also offers a more efficient solution compared to prior refinement methods.
zh

[NLP-37] A Multi-Encoder Frozen-Decoder Approach for Fine-Tuning Large Language Models

【速读】：该论文探讨了在多任务设置中冻结解码器（freezing decoders）对模型性能的影响，旨在减少部署开销并增强模型在新任务上的可移植性。研究通过在AlexaTM模型上进行单任务和多任务的微调实验，发现冻结解码器在自然语言输出任务中非常有效，并且能够减轻多语言任务中的灾难性遗忘（catastrophic forgetting）问题。关键解决方案在于，冻结解码器不仅可以加速训练和减少灾难性遗忘，还能通过与更大的模型结合，在结构化和问答任务中保持甚至提升性能，从而使其适用于更广泛的任务类型。

链接: https://arxiv.org/abs/2501.07818
作者: Kaustubh D. Dhole
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Among parameter-efficient fine-tuning methods, freezing has emerged as a popular strategy for speeding up training, reducing catastrophic forgetting, and improving downstream performance. We investigate the impact of freezing the decoder in a multi-task setup comprising diverse natural language tasks, aiming to reduce deployment overhead and enhance portability to novel tasks. Our experiments, conducted by fine-tuning both individual and multi-task setups on the AlexaTM model, reveal that freezing decoders is highly effective for tasks with natural language outputs and mitigates catastrophic forgetting in multilingual tasks. However, we find that pairing frozen decoders with a larger model can effectively maintain or even enhance performance in structured and QA tasks, making it a viable strategy for a broader range of task types.
zh

[NLP-38] Agent -Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models

【速读】：该论文试图解决的问题是缺乏一个系统框架来表征和比较大语言模型（LLMs）中的提示技术（prompting techniques），并理解这些技术与多代理系统（multi-agent systems）之间的关系。论文提出了线性上下文（linear contexts）和非线性上下文（non-linear contexts）的概念，分别指代单一连续交互序列和分支或多路径交互序列。基于这些概念，论文提出了一个以代理为中心的提示技术投影框架，旨在揭示提示策略与多代理系统之间的深层联系。解决方案的关键在于通过这一框架提出三个假设：（1）非线性提示技术的结果可以预测等效多代理系统的结果；（2）多代理系统架构可以通过模拟等效交互模式的单一LLM提示技术来复制；（3）这些等效性为生成合成训练数据提供了新的方法。这一视角为提示技术和多代理领域的研究成果提供了系统性的交叉融合，并为未来LLM系统的设计和训练提供了新的方向。

链接: https://arxiv.org/abs/2501.07815
作者: Dhruv Dhamani,Mary Lou Maher
机构: University of North Carolina, Charlotte(北卡罗来纳大学夏洛特分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 8 pages, 5 figures. Accepted at ICAART 2025. Derived from an early draft at 2312.17601 . arXiv admin note: substantial text overlap with arXiv:2312.17601

点击查看摘要

Abstract:Recent advances in prompting techniques and multi-agent systems for Large Language Models (LLMs) have produced increasingly complex approaches. However, we lack a framework for characterizing and comparing prompting techniques or understanding their relationship to multi-agent LLM systems. This position paper introduces and explains the concepts of linear contexts (a single, continuous sequence of interactions) and non-linear contexts (branching or multi-path) in LLM systems. These concepts enable the development of an agent-centric projection of prompting techniques, a framework that can reveal deep connections between prompting strategies and multi-agent systems. We propose three conjectures based on this framework: (1) results from non-linear prompting techniques can predict outcomes in equivalent multi-agent systems, (2) multi-agent system architectures can be replicated through single-LLM prompting techniques that simulate equivalent interaction patterns, and (3) these equivalences suggest novel approaches for generating synthetic training data. We argue that this perspective enables systematic cross-pollination of research findings between prompting and multi-agent domains, while providing new directions for improving both the design and training of future LLM systems.
zh

[NLP-39] alk to Right Specialists: Routing and Planning in Multi-agent System for Question Answering

【速读】：该论文试图解决当前基于检索增强生成（RAG）的智能体在处理跨领域查询时存在的局限性，特别是单一领域知识源导致的幻觉或不准确响应问题。此外，论文还探讨了将多个知识库集成到统一RAG系统中的挑战，如检索开销增加和数据主权问题。解决方案的关键在于提出了RopMura系统，该系统通过引入高效的路由机制（router）和规划机制（planner）来克服这些限制。路由机制能够根据知识边界智能选择最相关的智能体，而规划机制则将复杂的多跳查询分解为可管理的步骤，从而协调跨领域响应。实验结果表明，RopMura能够有效处理单跳和多跳查询，路由机制确保单跳查询的精确响应，而路由与规划机制的结合则实现了复杂查询的准确多步解析。

链接: https://arxiv.org/abs/2501.07813
作者: Feijie Wu,Zitao Li,Fei Wei,Yaliang Li,Bolin Ding,Jing Gao
机构: Purdue University(普渡大学); Alibaba Group(阿里巴巴集团)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work In Progress

点击查看摘要

Abstract:Leveraging large language models (LLMs), an agent can utilize retrieval-augmented generation (RAG) techniques to integrate external knowledge and increase the reliability of its responses. Current RAG-based agents integrate single, domain-specific knowledge sources, limiting their ability and leading to hallucinated or inaccurate responses when addressing cross-domain queries. Integrating multiple knowledge bases into a unified RAG-based agent raises significant challenges, including increased retrieval overhead and data sovereignty when sensitive data is involved. In this work, we propose RopMura, a novel multi-agent system that addresses these limitations by incorporating highly efficient routing and planning mechanisms. RopMura features two key components: a router that intelligently selects the most relevant agents based on knowledge boundaries and a planner that decomposes complex multi-hop queries into manageable steps, allowing for coordinating cross-domain responses. Experimental results demonstrate that RopMura effectively handles both single-hop and multi-hop queries, with the routing mechanism enabling precise answers for single-hop queries and the combined routing and planning mechanisms achieving accurate, multi-step resolutions for complex queries.
zh

[NLP-40] Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

【速读】：该论文试图解决当前图像金字塔（Image Pyramids）在处理多尺度图像时使用相同的大规模模型所导致的高计算成本问题。为了解决这一问题，作者提出了一种新颖的网络架构，称为参数倒置图像金字塔网络（Parameter-Inverted Image Pyramid Networks, PIIP）。PIIP的关键在于使用预训练模型（如ViTs或CNNs）作为分支来处理多尺度图像，其中高分辨率图像由较小的网络分支处理，以平衡计算成本和性能。此外，作者还提出了一种跨分支特征交互机制，以整合来自不同空间尺度的信息。通过在各种感知模型和多模态大语言模型（如LLaVA）上的实验验证，PIIP在降低计算成本的同时，显著提升了目标检测、分割、图像分类和多模态理解等任务的性能。

链接: https://arxiv.org/abs/2501.07783
作者: Zhaokai Wang,Xizhou Zhu,Xue Yang,Gen Luo,Hao Li,Changyao Tian,Wenhan Dou,Junqi Ge,Lewei Lu,Yu Qiao,Jifeng Dai
机构: Shanghai Jiao Tong University(上海交通大学); Tsinghua University(清华大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); The Chinese University of Hong Kong(香港中文大学); Sensetime(商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at this https URL.
zh

[NLP-41] Large Language Models for Knowledge Graph Embedding Techniques Methods and Challenges: A Survey

【速读】：该论文主要探讨了如何在不同类型的知识图谱嵌入（Knowledge Graph Embedding, KGE）场景中应用大语言模型（Large Language Models, LLMs），以提升处理效果。随着LLMs在自然语言处理（Natural Language Processing, NLP）领域的卓越表现，它们被越来越多地应用于多模态KGE和开放KGE等任务中。论文的关键解决方案包括：1）对不同类型的KGE场景进行分类，以便更好地比较各种方法；2）提供方法的表格概述及其源代码链接，便于直接比较；3）讨论这些方法的主要应用场景，并提出该新兴研究领域的若干前瞻性发展方向。通过这些措施，论文旨在为LLMs在KGE相关任务中的应用提供系统化的指导和参考。

链接: https://arxiv.org/abs/2501.07766
作者: Bingchen Liu,Xin Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have attracted a lot of attention in various fields due to their superior performance, aiming to train hundreds of millions or more parameters on large amounts of text data to understand and generate natural language. As the superior performance of LLMs becomes apparent, they are increasingly being applied to knowledge graph embedding (KGE) related tasks to improve the processing results. As a deep learning model in the field of Natural Language Processing (NLP), it learns a large amount of textual data to predict the next word or generate content related to a given text. However, LLMs have recently been invoked to varying degrees in different types of KGE related scenarios such as multi-modal KGE and open KGE according to their task characteristics. In this paper, we investigate a wide range of approaches for performing LLMs-related tasks in different types of KGE scenarios. To better compare the various approaches, we summarize each KGE scenario in a classification. In addition to the categorization methods, we provide a tabular overview of the methods and their source code links for a more direct comparison. In the article we also discuss the applications in which the methods are mainly used and suggest several forward-looking directions for the development of this new research area.
zh

[NLP-42] A Heterogeneous Multimodal Graph Learning Framework for Recognizing User Emotions in Social Networks

【速读】：该论文旨在解决社交媒体平台中用户情感预测的问题，特别是如何利用多模态用户生成内容（multimodal user-generated content）来提升情感计算的准确性。尽管情感计算（Affective Computing）领域已取得显著进展，但社交媒体中影响用户情感的多样化因素仍相对缺乏深入研究，且现有方法中缺乏基于深度学习的情感预测模型。为此，论文提出了一种基于异质图学习（heterogeneous graph learning）的个性化情感预测新方法，并设计了HMG-Emo框架。该框架通过深度学习提取特征，并结合动态上下文融合模块（dynamic context fusion module），能够自适应地整合社交媒体数据中的多模态信息。实验结果表明，HMG-Emo在情感预测任务中优于现有基于手工特征的方法，验证了图神经网络（graph neural network）在该领域的优越性。该研究强调了利用先进深度学习技术解决情感计算中尚未充分探索问题的重要性。

链接: https://arxiv.org/abs/2501.07746
作者: Sree Bhattacharyya,Shuhua Yang,James Z. Wang
机构: College of Information Sciences and Technology, The Pennsylvania State University, University Park (信息科学与技术学院, 宾夕法尼亚州立大学, 大学公园)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid expansion of social media platforms has provided unprecedented access to massive amounts of multimodal user-generated content. Comprehending user emotions can provide valuable insights for improving communication and understanding of human behaviors. Despite significant advancements in Affective Computing, the diverse factors influencing user emotions in social networks remain relatively understudied. Moreover, there is a notable lack of deep learning-based methods for predicting user emotions in social networks, which could be addressed by leveraging the extensive multimodal data available. This work presents a novel formulation of personalized emotion prediction in social networks based on heterogeneous graph learning. Building upon this formulation, we design HMG-Emo, a Heterogeneous Multimodal Graph Learning Framework that utilizes deep learning-based features for user emotion recognition. Additionally, we include a dynamic context fusion module in HMG-Emo that is capable of adaptively integrating the different modalities in social media data. Through extensive experiments, we demonstrate the effectiveness of HMG-Emo and verify the superiority of adopting a graph neural network-based approach, which outperforms existing baselines that use rich hand-crafted features. To the best of our knowledge, HMG-Emo is the first multimodal and deep-learning-based approach to predict personalized emotions within online social networks. Our work highlights the significance of exploiting advanced deep learning techniques for less-explored problems in Affective Computing.
zh

[NLP-43] Advancing Student Writing Through Automated Syntax Feedback

【速读】：该论文旨在解决学生在掌握英语句法（syntax）细节时面临的挑战，特别是如何通过句法反馈（syntax feedback）提升学生的句法能力。研究的关键解决方案是引入了一个专门设计的数据集 Essay-Syntax-Instruct，并通过微调（fine-tuning）多个大型语言模型（LLMs），如 GPT3.5-Turbo、Llama-2-7b-chat-hf、Llama-2-13b-chat-hf 和 Mistral-7B-Instruct-v0.2，来增强这些模型在句法改进任务中的表现。研究结果表明，经过微调的 LLMs 在识别和纠正句法错误方面表现出显著提升，从而为学生的语言学习提供了有效的工具。这一研究不仅展示了数据集在提升 LLMs 句法处理能力方面的有效性，还为利用先进语言模型支持语言学习提供了新的方向。

链接: https://arxiv.org/abs/2501.07740
作者: Kamyar Zeinalipour,Mehak Mehak,Fatemeh Parsamotamed,Marco Maggini,Marco Gori
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for presentation at AIEER 2024

点击查看摘要

Abstract:This study underscores the pivotal role of syntax feedback in augmenting the syntactic proficiency of students. Recognizing the challenges faced by learners in mastering syntactic nuances, we introduce a specialized dataset named Essay-Syntax-Instruct designed to enhance the understanding and application of English syntax among these students. Leveraging the capabilities of Large Language Models (LLMs) such as GPT3.5-Turbo, Llama-2-7b-chat-hf, Llama-2-13b-chat-hf, and Mistral-7B-Instruct-v0.2, this work embarks on a comprehensive fine-tuning process tailored to the syntax improvement task. Through meticulous evaluation, we demonstrate that the fine-tuned LLMs exhibit a marked improvement in addressing syntax-related challenges, thereby serving as a potent tool for students to identify and rectify their syntactic errors. The findings not only highlight the effectiveness of the proposed dataset in elevating the performance of LLMs for syntax enhancement but also illuminate a promising path for utilizing advanced language models to support language acquisition efforts. This research contributes to the broader field of language learning technology by showcasing the potential of LLMs in facilitating the linguistic development of Students.
zh

[NLP-44] Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech

【速读】：该论文旨在探讨卷积神经网络（CNN）中全连接层（FC layer）在语音合成任务中如何编码与语言学相关的信息。尽管已有研究关注卷积层在计算机视觉和音频领域的潜在空间与输出之间的对应关系，但全连接层在声学和语言学信息表示方面的作用尚未得到充分研究。论文提出了两种技术来探索全连接层：实验1将权重矩阵作为卷积层的输入，实验2通过操纵全连接层来研究符号化表示在CNN中的编码方式。关键解决方案在于利用全连接层输出的特征图和时间结构化的权重矩阵，展示学习权重的分布如何在潜在变量之间系统性地变化，并通过操纵全连接层来影响输出。最终，论文展示了一种能够输出单一语音片段的全连接层操纵技术，揭示了生成式CNN（ciwGAN）中的词汇特定潜在代码在全连接层权重中共享词汇不变性子词汇表示，表明ciwGAN以语言学原则的方式编码词汇信息。

链接: https://arxiv.org/abs/2501.07726
作者: Bruno Ferenc Šegedin,Gasper Beguš
机构: Brown University (布朗大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interpretability work on the convolutional layers of CNNs has primarily focused on computer vision, but some studies also explore correspondences between the latent space and the output in the audio domain. However, it has not been thoroughly examined how acoustic and linguistic information is represented in the fully connected (FC) layer that bridges the latent space and convolutional layers. The current study presents the first exploration of how the FC layer of CNNs for speech synthesis encodes linguistically relevant information. We propose two techniques for exploration of the fully connected layer. In Experiment 1, we use weight matrices as inputs into convolutional layers. In Experiment 2, we manipulate the FC layer to explore how symbolic-like representations are encoded in CNNs. We leverage the fact that the FC layer outputs a feature map and that variable-specific weight matrices are temporally structured to (1) demonstrate how the distribution of learned weights varies between latent variables in systematic ways and (2) demonstrate how manipulating the FC layer while holding constant subsequent model parameters affects the output. We ultimately present an FC manipulation that can output a single segment. Using this technique, we show that lexically specific latent codes in generative CNNs (ciwGAN) have shared lexically invariant sublexical representations in the FC-layer weights, showing that ciwGAN encodes lexical information in a linguistically principled manner.
zh

[NLP-45] ESURF: Simple and Effective EDU Segmentation

【速读】：该论文旨在解决将文本分割为基本话语单元（Elemental Discourse Units, EDUs）的问题，这是话语解析（discourse parsing）中的一项基础任务。论文提出了一种基于词汇和字符n-gram特征（lexical and character n-gram features）的随机森林分类（random forest classification）方法，用于识别EDU边界并进行分割。该方法的关键在于其简单性，尽管方法简单，但在分割任务中表现优于其他方法，并且在一个先进的话语解析器中也有优异表现。这表明词汇和字符n-gram特征在识别基本话语元素中的重要性，为话语分析提供了潜在的高效训练方法。

链接: https://arxiv.org/abs/2501.07723
作者: Mohammadreza Sediqin,Shlomo Engelson Argamon
机构: Illinois Institute of Technology(伊利诺伊理工学院); Touro University(图罗大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Segmenting text into Elemental Discourse Units (EDUs) is a fundamental task in discourse parsing. We present a new simple method for identifying EDU boundaries, and hence segmenting them, based on lexical and character n-gram features, using random forest classification. We show that the method, despite its simplicity, outperforms other methods both for segmentation and within a state of the art discourse parser. This indicates the importance of such features for identifying basic discourse elements, pointing towards potentially more training-efficient methods for discourse analysis.
zh

[NLP-46] LLM ic: Romanian Foundation Language Model

【速读】：该论文试图解决低资源语言（low-resource languages）在大型语言模型（Large Language Models, LLMs）中表现不佳的问题，特别是针对罗马尼亚语（Romanian Language）。由于低资源语言在训练语料库中的代表性有限，开放模型（open models）通常在这些语言上的表现较差。论文提出的解决方案是开发一个专门为罗马尼亚语设计的双语基础语言模型（bilingual foundation language model），称为LLMic。其关键步骤包括语料库构建（corpus construction）、架构选择（architecture selection）和超参数优化（hyper-parameter optimization）。通过预训练（pretraining）和微调（fine-tuning），LLMic在罗马尼亚语任务上表现优异，特别是在英语到罗马尼亚语的翻译任务中，超越了现有解决方案。这一成果为罗马尼亚语社区提供了高效的大规模处理能力，且使用的模型规模较小。

链接: https://arxiv.org/abs/2501.07721
作者: Vlad-Andrei Bădoiu,Mihai-Valentin Dumitru,Alexandru M. Gherghescu,Alexandru Agache,Costin Raiciu
机构: University Politehnica of Bucharest(布加勒斯特理工大学); Broadcom Inc.(博通公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks with commercial models leading the way. While open models usually operate at a smaller scale, they maintain competitiveness through specialization and fine-tuning. However, a significant challenge persists: open models often underperform in low-resource languages due to limited representation in the training corpus. In this paper, we present LLMic, a bilingual foundation language model designed specifically for the Romanian Language. We document the complete process of pretraining a foundation model for a low-resource language, including corpus construction, architecture selection, and hyper-parameter optimization. Our evaluation demonstrates that LLMic can be specialized for tasks in the target language, achieving results comparable to other much larger open models. We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks. This opens the path for efficient large-scale processing for the Romanian language community, using the much smaller LLMic model
zh

[NLP-47] Entailed Between the Lines: Incorporating Implication into NLI

【速读】：该论文试图解决自然语言推理（Natural Language Inference, NLI）模型在处理隐含蕴含（implied entailment）时的不足问题。隐含蕴含指的是文本中未明确表达但通过上下文或背景知识可以推断出的逻辑关系。现有的NLI模型和数据集在处理这类隐含蕴含时表现不佳，难以准确识别和区分隐含与显式蕴含。论文的关键解决方案是提出了一个扩展的NLI任务，即隐含NLI（Implied NLI, INLI），并引入了INLI数据集。通过在该数据集上对大型语言模型（LLMs）进行微调，模型能够更好地识别和理解各种隐含蕴含，并将这种理解能力推广到其他数据集和领域。

链接: https://arxiv.org/abs/2501.07719
作者: Shreya Havaldar,Hamidreza Alvari,Alex Fabrikant,John Palowitch,Mohammad Javad Hosseini,Senaka Buthpitiya
机构: University of Pennsylvania(宾夕法尼亚大学); Google Deepmind(谷歌 Deepmind)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Much of human communication depends on implication, conveying meaning beyond literal words to express a wider range of thoughts, intentions, and feelings. For models to better understand and facilitate human communication, they must be responsive to the text’s implicit meaning. We focus on Natural Language Inference (NLI), a core tool for many language tasks, and find that state-of-the-art NLI models and datasets struggle to recognize a range of cases where entailment is implied, rather than explicit from the text. We formalize implied entailment as an extension of the NLI task and introduce the Implied NLI dataset (INLI) to help today’s LLMs both recognize a broader variety of implied entailments and to distinguish between implicit and explicit entailment. We show how LLMs fine-tuned on INLI understand implied entailment and can generalize this understanding across datasets and domains.
zh

[NLP-48] Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles

【速读】：该论文旨在解决挪威语新闻文章的高质量摘要生成问题，特别是为生成式语言模型（Generative Language Models, LLMs）的抽象摘要能力提供一个基准测试数据集。解决方案的关键在于创建了一个包含挪威语新闻文章及其对应摘要的数据集，每篇文章都提供了三种由母语为挪威语的人撰写的高质量摘要，并且这些摘要以挪威语的两种书面形式——Bokmål和Nynorsk提供。通过该数据集，论文评估了现有的挪威语开放LLMs的摘要生成能力，并通过人工评估比较了人工撰写和模型生成的摘要。结果表明，该数据集为挪威语摘要生成提供了一个具有挑战性的基准。

链接: https://arxiv.org/abs/2501.07718
作者: Samia Touileb,Vladislav Mikhailov,Marie Kroka,Lilja Øvrelid,Erik Velldal
机构: University of Bergen(卑尔根大学); University of Oslo(奥斯陆大学)
类目: Computation and Language (cs.CL)
备注: Accepted at NoDaLiDa2025

点击查看摘要

Abstract:We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers, and all summaries are provided in both of the written variants of Norwegian – Bokmål and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model-generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities
zh

[NLP-49] A Survey of Early Exit Deep Neural Networks in NLP

【速读】：该论文试图解决深度神经网络（DNNs）在资源受限应用中的高计算需求问题，以及如何应对现实数据集中样本复杂度不一的情况。解决方案的关键在于采用早期退出策略（early exit strategies），通过在DNN的不同层附加分类器，使得简单样本可以在较早的层被分类，从而加速整体推理过程。这种方法不仅减少了推理延迟，还提高了模型对抗对抗性攻击的鲁棒性。论文全面综述了早期退出方法及其在自然语言处理（NLP）中的应用。

链接: https://arxiv.org/abs/2501.07670
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
机构: Department of IEOR, IIT Bombay (印度理工学院孟买分校工业工程与运筹学系)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have grown increasingly large in size to achieve state of the art performance across a wide range of tasks. However, their high computational requirements make them less suitable for resource-constrained applications. Also, real-world datasets often consist of a mixture of easy and complex samples, necessitating adaptive inference mechanisms that account for sample difficulty. Early exit strategies offer a promising solution by enabling adaptive inference, where simpler samples are classified using the initial layers of the DNN, thereby accelerating the overall inference process. By attaching classifiers at different layers, early exit methods not only reduce inference latency but also improve the model robustness against adversarial attacks. This paper presents a comprehensive survey of early exit methods and their applications in NLP.
zh

[NLP-50] Enhancing Talent Employment Insights Through Feature Extraction with LLM Finetuning

【速读】：该论文旨在解决从非结构化职位描述中提取复杂和细微职位特征的问题，特别是传统解析工具在处理这些特征时存在的局限性。论文通过使用AdeptID提供的120万条职位描述数据集，开发了一个强大的处理流程，能够识别和分类诸如远程工作可用性、薪酬结构、教育要求和工作经验偏好等变量。解决方案的关键在于结合语义分块（semantic chunking）、检索增强生成（retrieval-augmented generation, RAG）以及微调DistilBERT模型，从而显著提高了对常被误标或忽略的变量（如非薪酬类补偿和推断的远程工作类别）的识别能力。该方法为劳动力市场分析提供了更准确和可操作的见解，展示了大型语言模型（LLMs）在该领域的潜力。

链接: https://arxiv.org/abs/2501.07663
作者: Karishma Thakrar,Nick Young
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper explores the application of large language models (LLMs) to extract nuanced and complex job features from unstructured job postings. Using a dataset of 1.2 million job postings provided by AdeptID, we developed a robust pipeline to identify and classify variables such as remote work availability, remuneration structures, educational requirements, and work experience preferences. Our methodology combines semantic chunking, retrieval-augmented generation (RAG), and fine-tuning DistilBERT models to overcome the limitations of traditional parsing tools. By leveraging these techniques, we achieved significant improvements in identifying variables often mislabeled or overlooked, such as non-salary-based compensation and inferred remote work categories. We present a comprehensive evaluation of our fine-tuned models and analyze their strengths, limitations, and potential for scaling. This work highlights the promise of LLMs in labor market analytics, providing a foundation for more accurate and actionable insights into job data.
zh

[NLP-51] GPT as a Monte Carlo Language Tree: A Probabilistic Perspective

【速读】：该论文试图解决大语言模型（LLMs）在自然语言处理（NLP）任务中如何通过预测下一个词元（token）来学习大规模网络爬取数据集中的潜在分布的问题。尽管这种潜在分布建模机制被广泛使用，但其定量理解和分析仍然不足。论文提出了一种新颖的视角，即将任何语言数据集表示为一个蒙特卡洛语言树（Monte Carlo Language Tree，简称“Data-Tree”），其中每个节点代表一个词元，每条边代表词元之间的转移概率，每个序列都有唯一的路径。类似地，任何GPT类语言模型也可以被扁平化为另一个蒙特卡洛语言树（简称“GPT-Tree”）。通过实验，论文发现不同GPT模型在相同数据集上训练后，其GPT-Tree在可视化上表现出显著的结构相似性，且更大的模型更接近Data-Tree。超过87%的GPT输出词元可以通过Data-Tree召回。这些发现表明，LLMs的推理过程更可能是概率模式匹配，而非形式推理，因为每个模型推断似乎都是从Data-Tree中找到具有最大概率的上下文模式。此外，论文还深入探讨了LLMs中的幻觉（hallucination）、链式思维推理（Chain-of-Thought reasoning）和词元偏差（token bias）等问题。

链接: https://arxiv.org/abs/2501.07641
作者: Kun-Peng Ning,Jia-Yu Yao,Yu-Yang Liu,Mu-Nan Ning,Li Yuan
机构: School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), such as GPT, are considered to learn the latent distributions within large-scale web-crawl datasets and accomplish natural language processing (NLP) tasks by predicting the next token. However, this mechanism of latent distribution modeling lacks quantitative understanding and analysis. In this paper, we propose a novel perspective that any language dataset can be represented by a Monte Carlo Language Tree (abbreviated as Data-Tree''), where each node denotes a token, each edge denotes a token transition probability, and each sequence has a unique path. Any GPT-like language model can also be flattened into another Monte Carlo Language Tree (abbreviated as GPT-Tree’'). Our experiments show that different GPT models trained on the same dataset exhibit significant structural similarity in GPT-Tree visualization, and larger models converge more closely to the Data-Tree. More than 87% GPT output tokens can be recalled by Data-Tree. These findings may confirm that the reasoning process of LLMs is more likely to be probabilistic pattern-matching rather than formal reasoning, as each model inference seems to find a context pattern with maximum probability from the Data-Tree. Furthermore, we provide deeper insights into issues such as hallucination, Chain-of-Thought (CoT) reasoning, and token bias in LLMs.
zh

[NLP-52] Optimize Incompatible Parameters through Compatibility-aware Knowledge Integration AAAI’25 AAAI

【速读】：该论文试图解决深度神经网络（Deep Neural Networks, DNNs）中存在的参数不兼容性问题，这些不兼容参数可能导致模型性能下降，尤其是在面对特定且多变的数据分布时。现有的研究主要集中在通过移除这些参数或合并多个预训练模型的输出来解决问题，但这些方法要么侧重于效率而非性能，要么需要额外的计算和存储资源来支持推理。本文提出了一种名为“兼容性感知知识集成”（Compatibility-aware Knowledge Integration, CKI）的解决方案，其核心在于通过评估多个模型的参数兼容性（Parameter Compatibility Assessment）并将这些知识整合到一个模型中（Parameter Splicing），从而直接优化不兼容参数，提升模型性能，而无需增加推理成本。实验结果表明，CKI能够在多种任务和设置下有效优化不兼容参数，突破原始模型的训练限制。

链接: https://arxiv.org/abs/2501.07596
作者: Zheqi Lv,Keming Ye,Zishu Wei,Qi Tian,Shengyu Zhang,Wenqiao Zhang,Wenjie Wang,Kun Kuang,Tat-Seng Chua,Fei Wu
机构: 1. Zhejiang University(浙江大学); 2. University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校); 3. National University of Singapore(新加坡国立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Published on AAAI’25: The Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Deep neural networks have become foundational to advancements in multiple domains, including recommendation systems, natural language processing, and so on. Despite their successes, these models often contain incompatible parameters that can be underutilized or detrimental to model performance, particularly when faced with specific, varying data distributions. Existing research excels in removing such parameters or merging the outputs of multiple different pretrained models. However, the former focuses on efficiency rather than performance, while the latter requires several times more computing and storage resources to support inference. In this paper, we set the goal to explicitly improve these incompatible parameters by leveraging the complementary strengths of different models, thereby directly enhancing the models without any additional parameters. Specifically, we propose Compatibility-aware Knowledge Integration (CKI), which consists of Parameter Compatibility Assessment and Parameter Splicing, which are used to evaluate the knowledge content of multiple models and integrate the knowledge into one model, respectively. The integrated model can be used directly for inference or for further fine-tuning. We conduct extensive experiments on various datasets for recommendation and language tasks, and the results show that Compatibility-aware Knowledge Integration can effectively optimize incompatible parameters under multiple tasks and settings to break through the training limit of the original model without increasing the inference cost.
zh

[NLP-53] Optimizing Speech Multi-View Feature Fusion through Conditional Computation ICASSP2025

【速读】：该论文试图解决自监督学习（SSL）特征与传统频谱特征（如FBanks）在更新方向上存在冲突的问题，这种冲突影响了模型在多视图语音表示任务中的收敛速度和性能。为了解决这一问题，作者提出了一种基于条件计算（conditional computation）的广义特征融合框架，该框架包括一个梯度敏感的选通网络（gradient-sensitive gating network）和一个多阶段丢弃策略（multi-stage dropout strategy）。通过这一框架，作者成功缓解了特征冲突，增强了模型对多视图输入特征的鲁棒性，并在MUSTC数据集上的多个语音翻译任务中实现了与频谱模型相当的性能，同时加速了模型收敛。

链接: https://arxiv.org/abs/2501.08057
作者: Weiqiao Shan,Yuhao Zhang,Yuchen Han,Bei Li,Xiaofeng Zhao,Yuang Li,Min Zhang,Hao Yang,Tong Xiao,Jingbo Zhu
机构: 1School of Computer Science and Engineering, Northeastern University, Shenyang, China (东北大学计算机科学与工程学院, 沈阳, 中国); 2The Chinese University of Hong Kong, Shenzhen, China (香港中文大学, 深圳, 中国); 3Meituan, Beijing, China (美团, 北京, 中国); 4Huawei Translation Services Center, Beijing, China (华为翻译服务中心, 北京, 中国); 5NiuTrans Research, Shenyang, China (NiuTrans研究, 沈阳, 中国)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: ICASSP 2025

点击查看摘要

Abstract:Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
zh

计算机视觉

[CV-0] DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models

【速读】：该论文旨在解决如何通过学习人类与物体交互（Human-Object Interaction, HOI）的动态模式来提升人工智能在日常生活中的应用能力。现有研究主要关注静态场景下的人类-物体模式（如接触、空间关系、方向等），而对随时间变化的人类-物体交互模式（即人类和物体的运动）的研究相对较少。为此，论文提出了一种名为“动态可供性”（Dynamic Affordance）的新概念，通过学习给定3D物体网格的动态可供性，建模人类运动和人类引导的物体姿态在交互过程中的分布。

解决方案的关键在于利用预训练的视频扩散模型（video diffusion model），从合成的2D视频中学习3D动态可供性。具体而言，论文提出了一种流程，首先从3D物体生成2D HOI视频，然后将其提升为3D以生成4D HOI样本。在生成多样化的4D HOI样本后，论文训练了一个名为DAViD的模型，该模型基于低秩适应（Low-Rank Adaptation, LoRA）模块，结合了预训练的人类运动扩散模型（MDM）和带有人类姿态引导的物体姿态扩散模型。通过扩展运动扩散模型以处理多物体交互，论文展示了其流程在结合物体使用概念方面的优势。实验结果表明，DAViD在生成带有HOI的人类运动方面优于基线模型。

链接: https://arxiv.org/abs/2501.08333
作者: Hyeonwoo Kim,Sangwon Beak,Hanbyul Joo
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Understanding the ability of humans to use objects is crucial for AI to improve daily life. Existing studies for learning such ability focus on human-object patterns (e.g., contact, spatial relation, orientation) in static situations, and learning Human-Object Interaction (HOI) patterns over time (i.e., movement of human and object) is relatively less explored. In this paper, we introduce a novel type of affordance named Dynamic Affordance. For a given input 3D object mesh, we learn dynamic affordance which models the distribution of both (1) human motion and (2) human-guided object pose during interactions. As a core idea, we present a method to learn the 3D dynamic affordance from synthetically generated 2D videos, leveraging a pre-trained video diffusion model. Specifically, we propose a pipeline that first generates 2D HOI videos from the 3D object and then lifts them into 3D to generate 4D HOI samples. Once we generate diverse 4D HOI samples on various target objects, we train our DAViD, where we present a method based on the Low-Rank Adaptation (LoRA) module for pre-trained human motion diffusion model (MDM) and an object pose diffusion model with human pose guidance. Our motion diffusion model is extended for multi-object interactions, demonstrating the advantage of our pipeline with LoRA for combining the concepts of object usage. Through extensive experiments, we demonstrate our DAViD outperforms the baselines in generating human motion with HOIs.
zh

[CV-1] MangaNinja: Line Art Colorization with Precise Reference Following

【速读】：该论文旨在解决参考引导的线稿上色（reference-guided line art colorization）任务中的精确细节转录问题。解决方案的关键在于引入了两个创新设计：首先，通过一个补丁混洗模块（patch shuffling module）来促进参考彩色图像与目标线稿之间的对应关系学习；其次，采用点驱动控制方案（point-driven control scheme）以实现细粒度的颜色匹配。这些设计使得模型在处理复杂案例、跨角色上色和多参考协调等现有算法难以应对的场景时表现出色，实验结果表明该模型在精确上色方面优于当前的其他解决方案。

链接: https://arxiv.org/abs/2501.08332
作者: Zhiheng Liu,Ka Leong Cheng,Xi Chen,Jie Xiao,Hao Ouyang,Kai Zhu,Yu Liu,Yujun Shen,Qifeng Chen,Ping Luo
机构: HKU(香港大学); HKUST(香港科技大学); Tongyi Lab(通义实验室); Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL

点击查看摘要

Abstract:Derived from diffusion models, MangaNinjia specializes in the task of reference-guided line art colorization. We incorporate two thoughtful designs to ensure precise character detail transcription, including a patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching. Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. We further showcase the potential of the proposed interactive point control in handling challenging cases, cross-character colorization, multi-reference harmonization, beyond the reach of existing algorithms.
zh

[CV-2] Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

【速读】：该论文旨在解决视频扩散模型（video diffusion models）中运动控制的问题。传统的视频扩散模型通过随机噪声生成视频，缺乏对运动的具体控制。本文提出了一种通过结构化潜在噪声采样来实现运动控制的方法。其关键解决方案是引入了一种新颖的噪声扭曲算法（noise warping algorithm），该算法能够实时运行，并通过光流场（optical flow fields）生成相关的扭曲噪声，从而替代随机的时间高斯噪声，同时保持空间高斯性。这种方法不需要改变扩散模型的架构或训练流程，仅通过对训练视频进行预处理来生成结构化噪声。通过这种方式，本文实现了局部物体运动控制、全局相机运动控制以及运动传递等多种用户友好的运动控制功能，并在保持每帧像素质量的同时，有效提升了时间连贯性。实验和用户研究表明，该方法在视频扩散模型中的运动控制方面具有显著优势。

链接: https://arxiv.org/abs/2501.08331
作者: Ryan Burgert,Yuancheng Xu,Wenqi Xian,Oliver Pilarski,Pascal Clausen,Mingming He,Li Ma,Yitong Deng,Lingxiao Li,Mohsen Mousavi,Michael Ryoo,Paul Debevec,Ning Yu
机构: Netflix Eyeline Studios; Netflix; Stony Brook University(石溪大学); University of Maryland(马里兰大学); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative modeling aims to transform random noise into structured outputs. In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. This is achieved by just a change in data: we pre-process training videos to yield structured noise. Consequently, our method is agnostic to diffusion model design, requiring no changes to model architectures or training pipelines. Specifically, we propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, while preserving the spatial Gaussianity. The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead, and provide a one-stop solution for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. The harmonization between temporal coherence and spatial Gaussianity in our warped noise leads to effective motion control while maintaining per-frame pixel quality. Extensive experiments and user studies demonstrate the advantages of our method, making it a robust and scalable approach for controlling motion in video diffusion models. Video results are available on our webpage: this https URL source code and model checkpoints are available on GitHub: this https URL.
zh

[CV-3] Predicting 4D Hand Trajectory from Monocular Videos

【速读】：该论文试图解决从单目视频中推断连贯的四维（4D）手部轨迹的问题。当前基于视频的手部姿态重建方法主要关注利用相邻帧改进逐帧的三维（3D）姿态，而忽略了在空间中研究一致的四维手部轨迹。尽管这些方法利用了额外的时间线索，但由于标注视频数据的稀缺性，它们通常表现不如基于图像的方法。为了解决这些问题，作者重新利用了一种最先进的基于图像的Transformer模型，使其能够处理多帧并直接预测连贯的轨迹。关键解决方案包括引入两种轻量级的注意力层：跨视图自注意力（cross-view self-attention）用于融合时间信息，全局跨注意力（global cross-attention）用于引入更大的空间上下文。该方法在推断四维手部轨迹时与真实轨迹相似，同时保持了强大的二维重投影对齐。该方法在全局轨迹精度上显著优于现有方法，同时在单图像姿态估计方面与最先进的方法相当。

链接: https://arxiv.org/abs/2501.08329
作者: Yufei Ye,Yao Feng,Omid Taheri,Haiwen Feng,Shubham Tulsiani,Michael J. Black
机构: Carnegie Mellon University(卡内基梅隆大学); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present HaPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity of annotated video data. To address these issues, we repurpose a state-of-the-art image-based transformer to take in multiple frames and directly predict a coherent trajectory. We introduce two types of lightweight attention layers: cross-view self-attention to fuse temporal information, and global cross-attention to bring in larger spatial context. Our method infers 4D hand trajectories similar to the ground truth while maintaining strong 2D reprojection alignment. We apply the method to both egocentric and allocentric videos. It significantly outperforms existing methods in global trajectory accuracy while being comparable to the state-of-the-art in single-image pose estimation. Project website: this https URL
zh

[CV-4] Omni-RGPT : Unifying Image and Video Region-level Understanding via Token Marks

【速读】：该论文旨在解决多模态大语言模型在图像和视频的区域级理解（region-level comprehension）问题。为了实现跨时空维度的一致区域表示，论文提出了Token Mark机制，通过一组标记（tokens）在视觉特征空间中突出目标区域。这些标记通过区域提示（如边界框或掩码）直接嵌入到空间区域中，并同时融入文本提示中，从而在视觉和文本标记之间建立直接联系。此外，为了在不依赖轨迹（tracklets）的情况下支持鲁棒的视频理解，论文引入了一个辅助任务，利用标记的一致性来引导Token Mark，确保视频中区域的稳定解释。论文还引入了一个大规模的区域级视频指令数据集（RegVID-300k）。Omni-RGPT在图像和视频的常识推理基准测试中取得了最先进的成果，并在描述和指代表达理解任务中表现出色。

链接: https://arxiv.org/abs/2501.08326
作者: Miran Heo,Min-Hung Chen,De-An Huang,Sifei Liu,Subhashree Radhakrishnan,Seon Joo Kim,Yu-Chiang Frank Wang,Ryo Hachiuma
机构: NVIDIA; Yonsei University(延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.
zh

[CV-5] GameFactory: Creating New Games with Generative Interactive Videos

【速读】：该论文试图解决现有基于视频的游戏生成方法在场景泛化（scene generalization）方面的不足，这些方法通常局限于固定风格和场景的游戏，难以生成全新且多样化的游戏内容。为解决这一问题，论文提出了GameFactory框架，其关键解决方案包括：利用预训练的视频扩散模型（video diffusion models）来处理开放域视频数据，并通过多阶段训练策略（multi-phase training strategy）将游戏风格学习与动作控制解耦，从而在保持开放域泛化能力的同时实现动作可控性。此外，论文还发布了GF-Minecraft数据集，并扩展了框架以支持自回归动作可控的游戏视频生成，能够生成无限长度的交互式游戏视频。实验结果表明，GameFactory能够有效生成开放域、多样化且动作可控的游戏视频，推动了AI驱动的游戏生成技术的发展。

链接: https://arxiv.org/abs/2501.08325
作者: Jiwen Yu,Yiran Qin,Xintao Wang,Pengfei Wan,Di Zhang,Xihui Liu
机构: The University of Hong Kong (香港大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative game engines have the potential to revolutionize game development by autonomously creating new content and reducing manual workload. However, existing video-based game generation methods fail to address the critical challenge of scene generalization, limiting their applicability to existing games with fixed styles and scenes. In this paper, we present GameFactory, a framework focused on exploring scene generalization in game video generation. To enable the creation of entirely new and diverse games, we leverage pre-trained video diffusion models trained on open-domain video data. To bridge the domain gap between open-domain priors and small-scale game dataset, we propose a multi-phase training strategy that decouples game style learning from action control, preserving open-domain generalization while achieving action controllability. Using Minecraft as our data source, we release GF-Minecraft, a high-quality and diversity action-annotated video dataset for research. Furthermore, we extend our framework to enable autoregressive action-controllable game video generation, allowing the production of unlimited-length interactive game videos. Experimental results demonstrate that GameFactory effectively generates open-domain, diverse, and action-controllable game videos, representing a significant step forward in AI-driven game generation. Our dataset and project page are publicly available at \urlthis https URL.
zh

[CV-6] Diffusion Adversarial Post-Training for One-Step Video Generation

【速读】：该论文试图解决扩散模型（diffusion models）在图像和视频生成过程中迭代生成速度慢且计算成本高的问题。尽管现有的蒸馏方法（distillation approaches）在图像领域展示了一步生成（one-step generation）的潜力，但这些方法仍然存在显著的生成质量下降问题。为此，论文提出了一种对抗性后训练（Adversarial Post-Training, APT）方法，该方法在扩散预训练的基础上，通过对抗训练进一步提升一步生成的质量。关键改进包括模型架构和训练流程的优化，以及引入近似的R1正则化目标（R1 regularization objective）。实验结果表明，经过对抗性后训练的模型Seaweed-APT能够在单次前向评估步骤中实时生成2秒、1280x720分辨率、24fps的视频，并且能够一步生成1024px分辨率的图像，其生成质量与当前最先进的方法相当。

链接: https://arxiv.org/abs/2501.08316
作者: Shanchuan Lin,Xin Xia,Yuxi Ren,Ceyuan Yang,Xuefeng Xiao,Lu Jiang
机构: ByteDance Seed(字节跳动种子)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
zh

[CV-7] Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

【速读】：该论文旨在解决自主系统在动态环境中导航时的语义未来预测问题。为了解决这一问题，论文提出了FUTURIST方法，该方法采用了一种统一且高效的视觉序列变换器（transformer）架构。关键解决方案包括：1）引入多模态掩码视觉建模目标（multimodal masked visual modeling objective）和一种新颖的掩码机制（masking mechanism），这些设计使得模型能够有效地整合来自不同模态的可见信息，从而提高预测准确性；2）提出了一种无需变分自编码器（VAE-free）的分层标记化（hierarchical tokenization）过程，降低了计算复杂度，简化了训练流程，并支持高分辨率多模态输入的端到端训练。通过在Cityscapes数据集上的验证，FUTURIST在短期和中期的未来语义分割任务中展示了最先进的性能。

链接: https://arxiv.org/abs/2501.08303
作者: Efstathios Karypidis,Ioannis Kakogeorgiou,Spyros Gidaris,Nikos Komodakis
机构: Archimedes/Athena RC; valeo.ai; National Technical University of雅典国立技术大学; University of Crete(克里特大学); IACM-Forth
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code at this https URL .
zh

[CV-8] LayerAnimate: Layer-specific Control for Animation

【速读】：该论文旨在解决现有视频生成方法在处理动画时缺乏对单个动画层的细粒度控制的问题。现有的方法通常将动画视为一个整体的数据域，无法独立操作前景和背景元素。为此，论文提出了LayerAnimate，一种新颖的架构方法，通过在视频扩散模型（video diffusion model）中增强对单个动画层的控制，使用户能够独立操作不同层中的前景和背景元素。解决方案的关键在于引入了一个数据管理流程（data curation pipeline），该流程包括自动元素分割（automated element segmentation）、运动状态层次合并（motion-state hierarchical merging）和运动一致性优化（motion coherence refinement），以应对特定层数据不足的挑战。通过定量和定性比较以及用户研究，LayerAnimate在动画质量、控制精度和可用性方面均优于现有方法，为专业动画师和业余爱好者提供了理想的工具。

链接: https://arxiv.org/abs/2501.08295
作者: Yuxue Yang,Lue Fan,Zuzen Lin,Feng Wang,Zhaoxiang Zhang
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); NLPR & MAIS, Institute of Automation, Chinese Academy of Science(中国科学院自动化研究所模式识别与机器学习实验室); Tianjin University(天津大学); CreateAI(CreateAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Animated video separates foreground and background elements into layers, with distinct processes for sketching, refining, coloring, and in-betweening. Existing video generation methods typically treat animation as a monolithic data domain, lacking fine-grained control over individual layers. In this paper, we introduce LayerAnimate, a novel architectural approach that enhances fine-grained control over individual animation layers within a video diffusion model, allowing users to independently manipulate foreground and background elements in distinct layers. To address the challenge of limited layer-specific data, we propose a data curation pipeline that features automated element segmentation, motion-state hierarchical merging, and motion coherence refinement. Through quantitative and qualitative comparisons, and user study, we demonstrate that LayerAnimate outperforms current methods in terms of animation quality, control precision, and usability, making it an ideal tool for both professional animators and amateur enthusiasts. This framework opens up new possibilities for layer-specific animation applications and creative flexibility. Our code is available at this https URL.
zh

[CV-9] VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes

【速读】：该论文旨在解决在大型场景中使用单目视觉惯性（Visual-Inertial）SLAM（Simultaneous Localization and Mapping）系统时面临的挑战，特别是在处理大规模城市环境中的动态物体和全局一致性问题上。解决方案的关键在于提出了VINGS-Mono框架，该框架包含四个主要组件：VIO前端（VIO Front End）、2D高斯地图（2D Gaussian Map）、NVS闭环检测（NVS Loop Closure）和动态物体消除器（Dynamic Eraser）。VIO前端通过密集束调整（dense bundle adjustment）和不确定性估计（uncertainty estimation）提取场景几何和位姿信息，进而构建和维护2D高斯地图。2D高斯地图中的采样光栅化器（Sample-based Rasterizer）、分数管理器（Score Manager）和位姿优化（Pose Refinement）共同提升了地图构建速度和定位精度。NVS闭环检测模块创新性地利用高斯泼溅（Gaussian Splatting）的新视角合成（Novel View Synthesis）能力进行闭环检测和地图校正，确保大规模场景的全局一致性。动态物体消除器则有效处理了现实户外场景中不可避免的动态物体问题。实验表明，VINGS-Mono在定位性能上与视觉惯性里程计（Visual-Inertial Odometry）相当，并显著优于现有的高斯泼溅/神经辐射场（GS/NeRF）SLAM方法，同时在地图构建和渲染质量上表现优异。

链接: https://arxiv.org/abs/2501.08286
作者: Ke Wu,Zicheng Zhang,Muer Tie,Ziqing Ai,Zhongxue Gan,Wenchao Ding
机构: Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.
zh

[CV-10] Can Bayesian Neural Networks Explicitly Model Input Uncertainty?

【速读】：该论文试图解决机器学习模型输入中存在噪声或不确定性（uncertainty）的问题，并探讨贝叶斯神经网络（Bayesian Neural Networks, BNNs）及其近似方法是否能够有效处理输入不确定性。论文通过构建一个双输入（均值和标准差）的贝叶斯神经网络，评估了不同方法（如集成方法（Ensembles）、蒙特卡洛丢弃法（MC-Dropout）和Flipout）在输入不确定性估计中的表现。研究结果表明，只有部分近似贝叶斯神经网络的不确定性估计方法能够有效建模输入不确定性，特别是集成方法和Flipout。解决方案的关键在于通过实验验证不同方法在处理输入不确定性时的有效性，并识别出适合的近似贝叶斯方法。

链接: https://arxiv.org/abs/2501.08285
作者: Matias Valdenegro-Toro,Marco Zullich
机构: Department of Artificial Intelligence, University of Groningen (格罗宁根大学人工智能系)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 figures, VISAPP 2025 camera ready

点击查看摘要

Abstract:Inputs to machine learning models can have associated noise or uncertainties, but they are often ignored and not modelled. It is unknown if Bayesian Neural Networks and their approximations are able to consider uncertainty in their inputs. In this paper we build a two input Bayesian Neural Network (mean and standard deviation) and evaluate its capabilities for input uncertainty estimation across different methods like Ensembles, MC-Dropout, and Flipout. Our results indicate that only some uncertainty estimation methods for approximate Bayesian NNs can model input uncertainty, in particular Ensembles and Flipout.
zh

[CV-11] LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

【速读】：该论文试图解决多模态大语言模型（MLLMs）在处理时空定位（spatial-temporal localization）时面临的挑战。具体来说，现有方法难以同时有效地处理时间和空间定位，主要原因有两个：一是时空定位引入了大量的坐标组合，使得语言和视觉坐标表示的对齐变得复杂；二是在视频特征压缩过程中，编码细粒度的时间和空间信息本身具有难度。

为解决这些问题，论文提出了LLaVA-ST模型，其关键解决方案包括：1）语言对齐位置嵌入（Language-Aligned Positional Embedding），将文本坐标特殊标记嵌入视觉空间，简化细粒度时空对应关系的对齐；2）时空打包器（Spatial-Temporal Packer），将时间和空间分辨率特征压缩解耦为两个独立的点对区域注意力处理流；3）ST-Align数据集，包含430万训练样本，用于细粒度时空多模态理解。此外，论文还提出了渐进式训练管道，通过从粗到细的顺序对齐视觉和文本特征，并引入了ST-Align基准来评估时空交错细粒度理解任务。LLaVA-ST在11个需要细粒度时间、空间或时空交错多模态理解的基准测试中表现优异。

链接: https://arxiv.org/abs/2501.08282
作者: Hongyu Li,Jinyu Chen,Ziyu Wei,Shaofei Huang,Tianrui Hui,Jialin Gao,Xiaoming Wei,Si Liu
机构: School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院); School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机与信息工程学院); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine this http URL, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at this https URL .
zh

[CV-12] SmartEraser: Remove Anything from Images using Masked-Region Guidance

【速读】：该论文旨在解决现有物体移除（object removal）方法中存在的上下文信息缺失问题。传统方法采用“掩码-修复”（mask-and-inpaint）范式，即在输入中排除掩码区域，依赖未掩码区域来修复缺失部分。然而，这种方法缺乏对掩码区域的上下文信息，导致性能不稳定。论文提出了一种名为SmartEraser的新方法，其核心是引入“掩码区域引导”（Masked-Region Guidance）范式。该范式在输入中保留掩码区域，并将其作为移除过程的引导信息。这一方法具有两个关键优势：(a) 引导模型准确识别待移除物体，防止其在输出中重新生成；(b) 由于用户掩码通常超出物体本身，有助于在最终结果中保留周围上下文。此外，论文还提出了Syn4Removal数据集，利用实例分割数据将物体复制粘贴到图像上作为移除目标，并以原始图像作为真实值。实验结果表明，SmartEraser在复杂场景中显著优于现有方法，尤其在处理具有复杂结构的场景时表现出色。

链接: https://arxiv.org/abs/2501.08279
作者: Longtao Jiang,Zhendong Wang,Jianmin Bao,Wengang Zhou,Dongdong Chen,Lei Shi,Dong Chen,Houqiang Li
机构: University of Science and Technology of China(中国科学技术大学); Microsoft Research Asia(微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project at: this https URL

点击查看摘要

Abstract:Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions.
zh

[CV-13] AI Driven Water Segmentation with deep learning models for Enhanced Flood Monitoring

【速读】：该论文旨在解决洪水灾害的快速和准确检测与监测问题，以减轻因气候变化导致的洪水频率增加所带来的生命和经济损失。解决方案的关键在于利用深度学习模型（如UNet、ResNet和DeepLabv3）进行像素级的水体分割，从而自动识别和隔离图像中的洪水区域。研究通过创建一个新的数据集，结合无人机、实地观测和社交媒体图像，增强了模型的鲁棒性。这些模型在不同环境条件和地理位置的性能得到了测试，并讨论了各自的优势和局限性。通过预测图像分割掩码，该方法显著减少了处理时间，相比传统的半自动化方法，能够更快速生成洪水地图，为应急响应团队提供关键数据，减少生命和经济损失。此外，研究还提出了未来研究方向，包括多模态数据源的集成和专门针对洪水检测任务的深度学习架构的开发。

链接: https://arxiv.org/abs/2501.08266
作者: Sanjida Afrin Mou(1),Tasfia Noor Chowdhury(2),Adib Ibn Mannan(3),Sadia Nourin Mim(4),Lubana Tarannum(5),Tasrin Noman(6),Jamal Uddin Ahamed((1) Department of Mechatronics amp; Industrial Engineering, Chittagong University of Engineering amp; Technology (CUET), Chattogram, Bangladesh (2) Department of Mechanical Engineering, Chittagong University of Engineering amp; Technology (CUET), Chattogram, Bangladesh)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Flooding is a major natural hazard causing significant fatalities and economic losses annually, with increasing frequency due to climate change. Rapid and accurate flood detection and monitoring are crucial for mitigating these impacts. This study compares the performance of three deep learning models UNet, ResNet, and DeepLabv3 for pixelwise water segmentation to aid in flood detection, utilizing images from drones, in field observations, and social media. This study involves creating a new dataset that augments wellknown benchmark datasets with flood-specific images, enhancing the robustness of the models. The UNet, ResNet, and DeepLab v3 architectures are tested to determine their effectiveness in various environmental conditions and geographical locations, and the strengths and limitations of each model are also discussed here, providing insights into their applicability in different scenarios by predicting image segmentation masks. This fully automated approach allows these models to isolate flooded areas in images, significantly reducing processing time compared to traditional semi-automated methods. The outcome of this study is to predict segmented masks for each image effected by a flood disaster and the validation accuracy of these models. This methodology facilitates timely and continuous flood monitoring, providing vital data for emergency response teams to reduce loss of life and economic damages. It offers a significant reduction in the time required to generate flood maps, cutting down the manual processing time. Additionally, we present avenues for future research, including the integration of multimodal data sources and the development of robust deep learning architectures tailored specifically for flood detection tasks. Overall, our work contributes to the advancement of flood management strategies through innovative use of deep learning technologies.
zh

[CV-14] owards an End-to-End (E2E) Adversarial Learning and Application in the Physical World

【速读】：该论文试图解决传统基于补丁（patch-based）的对抗攻击在从数字域（digital domain）转移到物理域（physical domain）时性能下降的问题，特别是由于对抗补丁在数字域和物理域之间的可转移性（transferability）有限。为了解决这一问题，论文提出了物理域对抗补丁学习增强（Physical-domain Adversarial Patch Learning Augmentation, PAPLA）框架，这是一种新颖的端到端（end-to-end, E2E）框架，通过使用投影仪将对抗学习从数字域转换到物理域。PAPLA框架的关键在于直接在物理域中进行对抗学习，从而避免了数字域到物理域的可转移性问题，并在多种场景下（包括实验室环境和真实户外环境）验证了其攻击成功率优于传统的数字学习-物理应用（DL-PA）方法。此外，论文还分析了环境因素（如投影表面颜色、投影仪强度、环境光、目标物体与相机的距离和角度）对投影补丁效果的影响，并展示了在真实户外环境中对停车车辆和停车标志进行攻击的可行性。

链接: https://arxiv.org/abs/2501.08258
作者: Dudi Biton,Jacob Shams,Koda Satoru,Asaf Shabtai,Yuval Elovici,Ben Nassi
机构: Ben-Gurion University of the Negev(本古里安大学); Fujitsu Limited(富士通株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The traditional learning process of patch-based adversarial attacks, conducted in the digital domain and then applied in the physical domain (e.g., via printed stickers), may suffer from reduced performance due to adversarial patches’ limited transferability from the digital domain to the physical domain. Given that previous studies have considered using projectors to apply adversarial attacks, we raise the following question: can adversarial learning (i.e., patch generation) be performed entirely in the physical domain with a projector? In this work, we propose the Physical-domain Adversarial Patch Learning Augmentation (PAPLA) framework, a novel end-to-end (E2E) framework that converts adversarial learning from the digital domain to the physical domain using a projector. We evaluate PAPLA across multiple scenarios, including controlled laboratory settings and realistic outdoor environments, demonstrating its ability to ensure attack success compared to conventional digital learning-physical application (DL-PA) methods. We also analyze the impact of environmental factors, such as projection surface color, projector strength, ambient light, distance, and angle of the target object relative to the camera, on the effectiveness of projected patches. Finally, we demonstrate the feasibility of the attack against a parked car and a stop sign in a real-world outdoor environment. Our results show that under specific conditions, E2E adversarial learning in the physical domain eliminates the transferability issue and ensures evasion by object detectors. Finally, we provide insights into the challenges and opportunities of applying adversarial learning in the physical domain and explain where such an approach is more effective than using a sticker.
zh

[CV-15] Continual Deep Active Learning for Medical Imaging: Replay-Base Architecture for Context Adaptation

【速读】：该论文旨在解决医学影像分析中深度学习模型在新环境下的适应性和泛化能力不足，以及特定任务中标注数据不足的问题。解决方案的关键在于结合持续学习（Continual Learning, CL）和主动学习（Active Learning, AL）的方法，提出了一种名为“基于回放的上下文适应架构”（Replay-Base Architecture for Context Adaptation, RBACA）的新框架。该框架通过CL的复习机制（rehearsal method）从多样化的上下文中持续学习，并通过AL组件选择最具信息量的样本进行标注，从而减少标注工作量。此外，论文还提出了一种新的评估指标IL-Score，用于同时评估迁移学习、遗忘和最终模型性能。实验结果表明，RBACA在心脏图像的分割和诊断任务中，优于未使用CAL的基线框架和现有的CAL方法，且在不同内存大小和标注预算下均表现出色。

链接: https://arxiv.org/abs/2501.08245
作者: Rui Daniel,M. Rita Verdelho,Catarina Barata,Carlos Santiago
机构: Instituto de Sistemas e Robótica, Instituto Superior Técnico - Universidade de Lisboa, Portugal (系统与机器人研究所, 里斯本高等技术学院 - 里斯本大学, 葡萄牙)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep Learning for medical imaging faces challenges in adapting and generalizing to new contexts. Additionally, it often lacks sufficient labeled data for specific tasks requiring significant annotation effort. Continual Learning (CL) tackles adaptability and generalizability by enabling lifelong learning from a data stream while mitigating forgetting of previously learned knowledge. Active Learning (AL) reduces the number of required annotations for effective training. This work explores both approaches (CAL) to develop a novel framework for robust medical image analysis. Based on the automatic recognition of shifts in image characteristics, Replay-Base Architecture for Context Adaptation (RBACA) employs a CL rehearsal method to continually learn from diverse contexts, and an AL component to select the most informative instances for annotation. A novel approach to evaluate CAL methods is established using a defined metric denominated IL-Score, which allows for the simultaneous assessment of transfer learning, forgetting, and final model performance. We show that RBACA works in domain and class-incremental learning scenarios, by assessing its IL-Score on the segmentation and diagnosis of cardiac images. The results show that RBACA outperforms a baseline framework without CAL, and a state-of-the-art CAL method across various memory sizes and annotation budgets. Our code is available in this https URL .
zh

[CV-16] A Feature-Level Ensemble Model for COVID-19 Identification in CXR Images using Choquet Integral and Differential Evolution Optimization

【速读】：该论文旨在解决COVID-19诊断中RT-PCR（逆转录聚合酶链反应）存在的假阴性问题，提出了一种基于深度学习（Deep Learning）的诊断系统，通过整合预训练的深度卷积神经网络（Deep Convolutional Neural Networks, DCNNs）来精确识别胸部X光（Chest X-ray, CXR）图像中的COVID-19病例。解决方案的关键在于采用集成学习（ensemble learning）框架，结合Choquet积分（Choquet integral）来捕捉不同DCNNs之间的非线性交互作用，并使用Sugeno-λ测度理论（Sugeno-λ measure theory）推导模糊测度（fuzzy measures），通过差分进化（Differential Evolution）估计模糊密度（fuzzy densities）。此外，论文开发了一个基于TensorFlow的Choquet操作层，以实现高效的特征向量聚合。实验结果表明，该集成模型在COVIDx数据集上的三分类和二元分类任务中分别达到了98%和99.50%的准确率，显著优于其组件模型（如DenseNet-201、Inception-v3和Xception）及许多先前的方法。

链接: https://arxiv.org/abs/2501.08241
作者: Amir Reza Takhsha,Maryam Rastgarpour,Mozhgan Naderi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The COVID-19 pandemic has profoundly impacted billions globally. It challenges public health and healthcare systems due to its rapid spread and severe respiratory effects. An effective strategy to mitigate the COVID-19 pandemic involves integrating testing to identify infected individuals. While RT-PCR is considered the gold standard for diagnosing COVID-19, it has some limitations such as the risk of false negatives. To address this problem, this paper introduces a novel Deep Learning Diagnosis System that integrates pre-trained Deep Convolutional Neural Networks (DCNNs) within an ensemble learning framework to achieve precise identification of COVID-19 cases from Chest X-ray (CXR) images. We combine feature vectors from the final hidden layers of pre-trained DCNNs using the Choquet integral to capture interactions between different DCNNs that a linear approach cannot. We employed Sugeno- \lambda measure theory to derive fuzzy measures for subsets of networks to enable aggregation. We utilized Differential Evolution to estimate fuzzy densities. We developed a TensorFlow-based layer for Choquet operation to facilitate efficient aggregation, due to the intricacies involved in aggregating feature vectors. Experimental results on the COVIDx dataset show that our ensemble model achieved 98% accuracy in three-class classification and 99.50% in binary classification, outperforming its components-DenseNet-201 (97% for three-class, 98.75% for binary), Inception-v3 (96.25% for three-class, 98.50% for binary), and Xception (94.50% for three-class, 98% for binary)-and surpassing many previous methods.
zh

[CV-17] Efficient Deep Learning-based Forward Solvers for Brain Tumor Growth Models

【速读】：该论文试图解决胶质母细胞瘤（Glioblastoma）治疗中的模型校准问题，特别是在放疗规划中通过基于偏微分方程（PDE）的模型模拟患者特异性肿瘤行为时，优化方法（如蒙特卡罗采样和进化算法）的高计算需求成为瓶颈。为了解决这一问题，论文提出了一种利用神经前向求解器（neural forward solver）结合基于梯度的优化方法，以显著减少校准时间。该方案的关键在于需要一个高度精确且完全可微分的前向模型。论文研究了多种架构，包括增强的TumorSurrogate、改进的nnU-Net和3D Vision Transformer（ViT），其中优化的TumorSurrogate在肿瘤轮廓匹配和肿瘤细胞浓度的体素级预测方面表现最佳，显著提升了前向求解器的性能。

链接: https://arxiv.org/abs/2501.08226
作者: Zeineb Haouari,Jonas Weidner,Ivan Ezhov,Aswathi Varma,Daniel Rueckert,Bjoern Menze,Benedikt Wiestler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Glioblastoma, a highly aggressive brain tumor, poses major challenges due to its poor prognosis and high morbidity rates. Partial differential equation-based models offer promising potential to enhance therapeutic outcomes by simulating patient-specific tumor behavior for improved radiotherapy planning. However, model calibration remains a bottleneck due to the high computational demands of optimization methods like Monte Carlo sampling and evolutionary algorithms. To address this, we recently introduced an approach leveraging a neural forward solver with gradient-based optimization to significantly reduce calibration time. This approach requires a highly accurate and fully differentiable forward model. We investigate multiple architectures, including (i) an enhanced TumorSurrogate, (ii) a modified nnU-Net, and (iii) a 3D Vision Transformer (ViT). The optimized TumorSurrogate achieved the best overall results, excelling in both tumor outline matching and voxel-level prediction of tumor cell concentration. It halved the MSE relative to the baseline model and achieved the highest Dice score across all tumor cell concentration thresholds. Our study demonstrates significant enhancement in forward solver performance and outlines important future research directions.
zh

[CV-18] FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

【速读】：该论文旨在解决交互式图像编辑（Interactive Image Editing）中的两个主要问题：一是现有方法通常依赖于大量训练样本，二是需要额外的参考编码器来学习真实世界的动态和视觉一致性。为了解决这些问题，论文将任务重新定义为图像到视频生成（Image-to-Video Generation）问题，从而继承强大的视频扩散先验（Video Diffusion Priors），以减少训练成本并确保时间一致性。具体而言，论文提出了FramePainter作为该任务的高效实现，其基于Stable Video Diffusion初始化，仅使用轻量级的稀疏控制编码器（Sparse Control Encoder）来注入编辑信号。此外，论文还提出了匹配注意力机制（Matching Attention），以扩大感受野并增强编辑图像与源图像之间的密集对应关系。通过这些关键创新，FramePainter在多种编辑信号下表现出色，显著优于现有方法，并在训练数据较少的情况下实现了高度无缝和连贯的图像编辑。

链接: https://arxiv.org/abs/2501.08225
作者: Yabo Zhang,Xinpeng Zhou,Yihan Zeng,Hang Xu,Hui Li,Wangmeng Zuo
机构: Harbin Institute of Technology(哈尔滨工业大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \eg, transform the clownfish into shark-like shape. Our code will be available at this https URL.
zh

[CV-19] EmoNeXt: an Adapted ConvNeXt for Facial Emotion Recognition

【速读】：该论文旨在解决面部表情识别（Facial Expression Recognition, FER）中的准确性问题。面部表情在人类交流中扮演着重要角色，能够表达广泛的情感。随着人工智能和计算机视觉的进步，深度神经网络已成为面部情感识别的有效工具。论文提出的解决方案是EmoNeXt，一种基于改进的ConvNeXt架构的深度学习框架。关键创新点包括：1）集成空间变换网络（Spatial Transformer Network, STN）以聚焦于面部特征丰富的区域；2）引入Squeeze-and-Excitation模块以捕捉通道间的依赖关系；3）提出自注意力正则化项，促使模型生成紧凑的特征向量。通过在FER2013数据集上的实验，EmoNeXt在情感分类准确率上优于现有的最先进深度学习模型。

链接: https://arxiv.org/abs/2501.08199
作者: Yassine El Boudouri,Amine Bohi
机构: CESI LINEACT Laboratory (CESI LINEACT 实验室); UR 7527 (UR 7527)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures and 2 tables. 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP), Poitiers, France

点击查看摘要

Abstract:Facial expressions play a crucial role in human communication serving as a powerful and impactful means to express a wide range of emotions. With advancements in artificial intelligence and computer vision, deep neural networks have emerged as effective tools for facial emotion recognition. In this paper, we propose EmoNeXt, a novel deep learning framework for facial expression recognition based on an adapted ConvNeXt architecture network. We integrate a Spatial Transformer Network (STN) to focus on feature-rich regions of the face and Squeeze-and-Excitation blocks to capture channel-wise dependencies. Moreover, we introduce a self-attention regularization term, encouraging the model to generate compact feature vectors. We demonstrate the superiority of our model over existing state-of-the-art deep learning models on the FER2013 dataset regarding emotion classification accuracy.
zh

[CV-20] Self-supervised Deep Hyperspectral Inpainting with the Plug and Play and Deep Image Prior Models

【速读】：该论文旨在解决高光谱图像（Hyperspectral images）在处理过程中可能受到噪声、失真或数据丢失等问题的影响，导致图像质量下降的问题。为了解决这一问题，论文提出了一种名为LRS-PnP-DIP(1-Lip)的收敛保证算法。该算法的关键在于结合了低秩（low-rank）和稀疏（sparse）模型，进一步挖掘了数据的内在结构，超越了传统且有时限制性较强的子空间联合模型。通过稳定性分析，该算法在温和假设下保证了收敛性，这对于其在实际场景中的应用至关重要。实验结果表明，该算法在视觉和定量上都表现出色，达到了最先进的修复效果。

链接: https://arxiv.org/abs/2501.08195
作者: Shuo Li,Mehrdad Yaghoobi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 31 pages, 9 Figures, 7 Tables. arXiv admin note: text overlap with arXiv:2306.08128

点击查看摘要

Abstract:Hyperspectral images are typically composed of hundreds of narrow and contiguous spectral bands, each containing information regarding the material composition of the imaged scene. However, these images can be affected by various sources of noise, distortions, or data loss, which can significantly degrade their quality and usefulness. This paper introduces a convergent guaranteed algorithm, LRS-PnP-DIP(1-Lip), which successfully addresses the instability issue of DHP that has been reported before. The proposed algorithm extends the successful joint low-rank and sparse model to further exploit the underlying data structures beyond the conventional and sometimes restrictive unions of subspace models. A stability analysis guarantees the convergence of the proposed algorithm under mild assumptions , which is crucial for its application in real-world scenarios. Extensive experiments demonstrate that the proposed solution consistently delivers visually and quantitatively superior inpainting results, establishing state-of-the-art performance.
zh

[CV-21] A Critical Synthesis of Uncertainty Quantification and Foundation Models in Monocular Depth Estimation

【速读】：该论文试图解决单目深度估计（monocular depth estimation）中度量深度估计（metric depth estimation）的可靠性和安全性问题，特别是在实际应用中如何减少关键错误并提高模型的可信度。解决方案的关键在于将五种不同的不确定性量化方法（uncertainty quantification methods）与当前最先进的DepthAnythingV2基础模型（foundation model）进行融合。通过这种方法，论文评估了这些方法在四个不同数据集上的表现，发现使用高斯负对数似然损失（Gaussian Negative Log-Likelihood Loss, GNLL）进行微调是一种特别有前景的方法，能够在保持预测性能和计算效率的同时，提供可靠的不确定性估计。这一研究为未来改进模型性能和可解释性奠定了基础，并为将不确定性量化与基础模型结合应用于其他关键任务（如语义分割和姿态估计）提供了新的研究方向。

链接: https://arxiv.org/abs/2501.08188
作者: Steven Landgraf,Rongjun Qin,Markus Ulrich
机构: Institute of Photogrammetry and Remote Sensing (IPF), Karlsruhe Institute of Technology (KIT), Germany (摄影测量与遥感研究所，卡尔斯鲁厄理工学院，德国); The Ohio State University, Columbus, Ohio, United States (俄亥俄州立大学，哥伦布，俄亥俄州，美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While recent foundation models have enabled significant breakthroughs in monocular depth estimation, a clear path towards safe and reliable deployment in the real-world remains elusive. Metric depth estimation, which involves predicting absolute distances, poses particular challenges, as even the most advanced foundation models remain prone to critical errors. Since quantifying the uncertainty has emerged as a promising endeavor to address these limitations and enable trustworthy deployment, we fuse five different uncertainty quantification methods with the current state-of-the-art DepthAnythingV2 foundation model. To cover a wide range of metric depth domains, we evaluate their performance on four diverse datasets. Our findings identify fine-tuning with the Gaussian Negative Log-Likelihood Loss (GNLL) as a particularly promising approach, offering reliable uncertainty estimates while maintaining predictive performance and computational efficiency on par with the baseline, encompassing both training and inference time. By fusing uncertainty quantification and foundation models within the context of monocular depth estimation, this paper lays a critical foundation for future research aimed at improving not only model performance but also its explainability. Extending this critical synthesis of uncertainty quantification and foundation models into other crucial tasks, such as semantic segmentation and pose estimation, presents exciting opportunities for safer and more reliable machine vision systems.
zh

[CV-22] CG-MER: A Card Game-based Multimodal dataset for Emotion Recognition

【速读】：该论文旨在解决情感计算（affective computing）领域中情感识别研究的资源不足问题，特别是缺乏多模态数据集的问题。为此，作者提出了一种新颖且全面的法语多模态数据集，涵盖面部表情、语音和手势三种主要模态，为情感识别提供了更全面的视角。该数据集通过让参与者在卡片游戏会话中表达多种情感来构建，并具有扩展性，未来可结合自然语言处理（Natural Language Processing, NLP）等更多模态。这一数据集为深入研究情感识别及其与数字技术之间的复杂关系提供了重要资源。

链接: https://arxiv.org/abs/2501.08182
作者: Nessrine Farhat,Amine Bohi,Leila Ben Letaifa,Rim Slama
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 8 pages, 2 figures and 4 tables. Sixteenth International Conference on Machine Vision (ICMV 2023), Yerevan, Armenia

点击查看摘要

Abstract:The field of affective computing has seen significant advancements in exploring the relationship between emotions and emerging technologies. This paper presents a novel and valuable contribution to this field with the introduction of a comprehensive French multimodal dataset designed specifically for emotion recognition. The dataset encompasses three primary modalities: facial expressions, speech, and gestures, providing a holistic perspective on emotions. Moreover, the dataset has the potential to incorporate additional modalities, such as Natural Language Processing (NLP) to expand the scope of emotion recognition research. The dataset was curated through engaging participants in card game sessions, where they were prompted to express a range of emotions while responding to diverse questions. The study included 10 sessions with 20 participants (9 females and 11 males). The dataset serves as a valuable resource for furthering research in emotion recognition and provides an avenue for exploring the intricate connections between human emotions and digital technologies.
zh

[CV-23] D2-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models AAAI2025

【速读】：该论文旨在解决扩散模型（Diffusion Models）在图像生成中由于去噪过程冗长和评分估计网络计算密集而导致的可扩展性问题，特别是在低延迟和资源受限的场景中。现有的后训练量化（Post-training Quantization, PTQ）方法虽然能够在不重新训练的情况下压缩和加速扩散模型，但会引入额外的量化噪声，导致均值和方差偏差。论文提出的解决方案D2-DPM（Dual Denoising Mechanism）通过双重去噪机制精确缓解量化噪声对噪声估计网络的不利影响。具体而言，D2-DPM将量化噪声对采样方程的影响分解为均值偏差和方差偏差两部分：均值偏差改变采样方程的漂移系数，影响轨迹趋势；方差偏差则放大扩散系数，影响采样轨迹的收敛性。D2-DPM在每个时间步对量化噪声进行去噪，然后通过逆扩散迭代对噪声样本进行去噪。实验结果表明，D2-DPM在生成质量上优于全精度模型，FID（Fréchet Inception Distance）降低了1.42，同时实现了3.99倍的压缩和11.67倍的比特操作加速。

链接: https://arxiv.org/abs/2501.08180
作者: Qian Zeng,Jie Song,Han Zheng,Hao Jiang,Mingli Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, acceptted by AAAI2025

点击查看摘要

Abstract:Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.
zh

[CV-24] Object-Centric 2D Gaussian Splatting: Background Removal and Occlusion-Aware Pruning for Compact Object Models ICPR

【速读】：该论文旨在解决当前高斯泼溅（Gaussian Splatting）方法在重建整个场景时无法针对特定对象进行重建的问题，导致计算成本高且不适用于对象特定的应用场景。论文提出了一种新颖的方法，利用对象掩码（object masks）实现针对性的重建，从而生成以对象为中心的模型。此外，论文引入了一种遮挡感知的剪枝策略（occlusion-aware pruning strategy），在不影响质量的前提下最小化高斯分布的数量。该方法能够重建紧凑的对象模型，生成的对象中心高斯和网格表示比基线方法小96%，训练速度快71%，同时保持竞争性的质量。这些表示可直接用于下游应用，如外观编辑和物理模拟，无需额外处理。

链接: https://arxiv.org/abs/2501.08174
作者: Marcel Rogge,Didier Stricker
机构: Augmented Vision, University of Kaiserslautern-Landau (凯泽斯劳滕-兰道大学); Department of Augmented Vision, Deutsches Forschungszentrum fuer Kuenstliche Intelligenz (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPRAM 2025 ( this https URL )

点击查看摘要

Abstract:Current Gaussian Splatting approaches are effective for reconstructing entire scenes but lack the option to target specific objects, making them computationally expensive and unsuitable for object-specific applications. We propose a novel approach that leverages object masks to enable targeted reconstruction, resulting in object-centric models. Additionally, we introduce an occlusion-aware pruning strategy to minimize the number of Gaussians without compromising quality. Our method reconstructs compact object models, yielding object-centric Gaussian and mesh representations that are up to 96% smaller and up to 71% faster to train compared to the baseline while retaining competitive quality. These representations are immediately usable for downstream applications such as appearance editing and physics simulation without additional processing.
zh

[CV-25] Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features

【速读】：该论文旨在解决如何评估多模态模型（multimodal models）在分析和解释图像方面的能力问题。为了解决这一问题，作者设计了一个基准测试（benchmark），重点关注七个关键的视觉方面：主体对象（main object）、附加对象（additional objects）、背景（background）、细节（detail）、主色调（dominant colors）、风格（style）和视角（viewpoint）。通过使用由多样化文本提示生成的14,580张图像数据集，评估了七种领先的多模态模型在准确识别和描述这些视觉方面的能力。该基准测试的关键在于提供了一个系统化的评估框架，揭示了这些模型在全面图像理解任务中的优势和不足，从而为开发者和研究人员在选择和优化多模态模型时提供了重要参考。

链接: https://arxiv.org/abs/2501.08170
作者: Evgenii Evstafev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 tables, 2 charts

点击查看摘要

Abstract:This article introduces a benchmark designed to evaluate the capabilities of multimodal models in analyzing and interpreting images. The benchmark focuses on seven key visual aspects: main object, additional objects, background, detail, dominant colors, style, and viewpoint. A dataset of 14,580 images, generated from diverse text prompts, was used to assess the performance of seven leading multimodal models. These models were evaluated on their ability to accurately identify and describe each visual aspect, providing insights into their strengths and weaknesses for comprehensive image understanding. The findings of this benchmark have significant implications for the development and selection of multimodal models for various image analysis tasks.
zh

[CV-26] Revolutionizing Communication with Deep Learning and XAI for Enhanced Arabic Sign Language Recognition

【速读】：该论文旨在解决阿拉伯手语（Arabic Sign Language, ArSL）识别问题，通过集成先进的深度学习模型（如MobileNetV3、ResNet50和EfficientNet-B2）并结合可解释人工智能（Explainable AI, XAI）技术，提升模型的识别准确性和可解释性。关键解决方案包括：1）采用复杂的数据增强方法以缓解类别不平衡问题；2）实施分层5折交叉验证（stratified 5-fold cross-validation）以提高模型的泛化能力；3）使用Grad-CAM（Gradient-weighted Class Activation Mapping）技术增强模型决策的透明度。通过这些创新，EfficientNet-B2在ArSL2018和RGB阿拉伯字母手语（AASL）数据集上分别达到了99.48%和98.99%的峰值准确率，为医疗、教育和包容性通信技术等领域的应用提供了新的基准。

链接: https://arxiv.org/abs/2501.08169
作者: Mazen Balat,Rewaa Awaad,Ahmed B. Zaky,Salah A. Aly
机构: CS & IT Department, Egypt-Japanese University of Science & Technology, Alexandria, Egypt(埃及-日本科学技术大学计算机科学与信息技术系); Faculty of Computing and Data Science, Badya University, Giza, Egypt(巴迪亚大学计算与数据科学学院); Computer Science Section, Faculty of Science, Fayoum University, Fayoum, Egypt(法尤姆大学理学院计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 13 pages, 25 figures, 16 tables

点击查看摘要

Abstract:This study introduces an integrated approach to recognizing Arabic Sign Language (ArSL) using state-of-the-art deep learning models such as MobileNetV3, ResNet50, and EfficientNet-B2. These models are further enhanced by explainable AI (XAI) techniques to boost interpretability. The ArSL2018 and RGB Arabic Alphabets Sign Language (AASL) datasets are employed, with EfficientNet-B2 achieving peak accuracies of 99.48% and 98.99%, respectively. Key innovations include sophisticated data augmentation methods to mitigate class imbalance, implementation of stratified 5-fold cross-validation for better generalization, and the use of Grad-CAM for clear model decision transparency. The proposed system not only sets new benchmarks in recognition accuracy but also emphasizes interpretability, making it suitable for applications in healthcare, education, and inclusive communication technologies.
zh

[CV-27] Energy Backdoor Attack to Deep Neural Networks

【速读】：该论文旨在解决深度学习（DL）加速器在边缘和移动部署中的能量攻击（energy attacks）问题，特别是针对基于稀疏性加速器（sparsity-based accelerators）的深度神经网络（DNNs）的能量后门攻击（energy backdoor attacks）。尽管已有研究探讨了推理时间的能量攻击，但能量后门攻击尚未被深入研究。论文提出了一种创新的能量后门攻击方法，分为两个阶段：后门注入（backdoor injection）和后门隐蔽性（backdoor stealthiness）。通过在ResNet-18和MobileNet-V2模型上使用CIFAR-10和Tiny ImageNet数据集进行实验，证明了该攻击在触发样本上显著增加能量消耗的同时，保持模型对正常输入的性能。这一研究揭示了DNNs在能量后门攻击下的脆弱性。

链接: https://arxiv.org/abs/2501.08152
作者: Hanene F. Z. Brachemi Meftah,Wassim Hamidouche,Sid Ahmed Fezza,Olivier Déforges,Kassem Kallas
机构: 1Univ. Rennes, INSA Rennes, CNRS, IETR, UMR 6164, Rennes, France (雷恩大学, 雷恩国立应用科学学院, 法国国家科学研究中心, 电子与电信研究所, 联合研究单位6164, 法国雷恩); 2KU 6G Research Center, Department of Computer and Information Engineering, Khalifa University, Abu Dhabi, UAE (哈利法大学计算机与信息工程系, 6G研究中心, 阿联酋阿布扎比); 3National Higher School of Telecommunications and ICT, Oran, Algeria (国家高等电信与信息通信技术学院, 阿尔及利亚奥兰); 4National Institute of Health and Medical Research, LaTIM, Brest, France (法国国家健康与医学研究院, 拉蒂姆研究所, 法国布雷斯特)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rise of deep learning (DL) has increased computing complexity and energy use, prompting the adoption of application specific integrated circuits (ASICs) for energy-efficient edge and mobile deployment. However, recent studies have demonstrated the vulnerability of these accelerators to energy attacks. Despite the development of various inference time energy attacks in prior research, backdoor energy attacks remain unexplored. In this paper, we design an innovative energy backdoor attack against deep neural networks (DNNs) operating on sparsity-based accelerators. Our attack is carried out in two distinct phases: backdoor injection and backdoor stealthiness. Experimental results using ResNet-18 and MobileNet-V2 models trained on CIFAR-10 and Tiny ImageNet datasets show the effectiveness of our proposed attack in increasing energy consumption on trigger samples while preserving the model’s performance for clean/regular inputs. This demonstrates the vulnerability of DNNs to energy backdoor attacks. The source code of our attack is available at: this https URL.
zh

[CV-28] Bootstrapping Corner Cases: High-Resolution Inpainting for Safety Critical Detect and Avoid for Automated Flying

【速读】：该论文试图解决在无人机自动飞行中，用于安全关键功能“检测与避让”（Detect and Avoid）的目标检测问题。由于目标检测本身属于极端情况（corner case），生成高质量且大规模的数据集是一个不适定问题（ill-posed problem），现有模型往往因原始数据中地面实况（ground truth）有限而导致检测率低下且不可靠。论文通过使用图像修复（inpainting）方法来自举（bootstrap）数据集，使其明确包含原始数据中的极端情况，从而克服这一问题。解决方案的关键在于利用生成模型和图像修复技术，从小规模标注数据集中生成高分辨率数据集，并通过独立的目标检测器验证其有效性。

链接: https://arxiv.org/abs/2501.08142
作者: Jonathan Lyhs,Lars Hinneburg,Michael Fischer,Florian Ölsner,Stefan Milz,Jeremy Tschirner,Patrick Mäder
机构: Spleenlab GmbH; Ilmenau University of Technology (伊尔梅瑙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern machine learning techniques have shown tremendous potential, especially for object detection on camera images. For this reason, they are also used to enable safety-critical automated processes such as autonomous drone flights. We present a study on object detection for Detect and Avoid, a safety critical function for drones that detects air traffic during automated flights for safety reasons. An ill-posed problem is the generation of good and especially large data sets, since detection itself is the corner case. Most models suffer from limited ground truth in raw data, \eg recorded air traffic or frontal flight with a small aircraft. It often leads to poor and critical detection rates. We overcome this problem by using inpainting methods to bootstrap the dataset such that it explicitly contains the corner cases of the raw data. We provide an overview of inpainting methods and generative models and present an example pipeline given a small annotated dataset. We validate our method by generating a high-resolution dataset, which we make publicly available and present it to an independent object detector that was fully trained on real data.
zh

[CV-29] Audio-visual Deepfake Detection With Local Temporal Inconsistencies ICASSP2025

【速读】：该论文旨在解决音频-视觉深度伪造（audio-visual deepfake）检测中的细粒度时间不一致性问题。为了解决这一问题，论文提出了两种关键策略：架构设计和数据合成。从架构设计角度，论文引入了一种时间距离图（temporal distance map）并结合注意力机制（attention mechanism），以捕捉音频和视觉模态之间的时间不一致性，同时减少无关时间子序列的影响。此外，论文还探索了新的伪伪造生成技术（pseudo-fake generation techniques），用于合成局部不一致性。通过在DFDC和FakeAVCeleb数据集上的评估，该方法在检测音频-视觉深度伪造方面表现出显著的有效性。

链接: https://arxiv.org/abs/2501.08137
作者: Marcella Astrid,Enjie Ghorbel,Djamila Aouada
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in ICASSP 2025

点击查看摘要

Abstract:This paper proposes an audio-visual deepfake detection approach that aims to capture fine-grained temporal inconsistencies between audio and visual modalities. To achieve this, both architectural and data synthesis strategies are introduced. From an architectural perspective, a temporal distance map, coupled with an attention mechanism, is designed to capture these inconsistencies while minimizing the impact of irrelevant temporal subsequences. Moreover, we explore novel pseudo-fake generation techniques to synthesize local inconsistencies. Our approach is evaluated against state-of-the-art methods using the DFDC and FakeAVCeleb datasets, demonstrating its effectiveness in detecting audio-visual deepfakes.
zh

[CV-30] SAR Strikes Back: A New Hope for RSVQA

【速读】：该论文试图解决遥感视觉问答（Remote Sensing Visual Question Answering, RSVQA）任务中如何有效利用合成孔径雷达（Synthetic Aperture Radar, SAR）图像的问题。SAR图像能够捕捉场景的电磁信息，并且较少受到云层等大气条件的影响，因此在遥感图像分析中具有独特优势。然而，现有的RSVQA方法主要基于光学图像，尚未有方法能够从SAR图像中提取信息并回答问题。

论文的关键解决方案包括两个方面：首先，提出了两种不同的模型来引入SAR模态。第一种是端到端的方法，通过增加一个额外的SAR编码器来处理SAR图像；第二种是基于两阶段的框架，首先从SAR和可选的光学数据中提取相关信息，然后将这些信息转换为自然语言，供第二阶段的语言模型生成答案。研究发现，第二种方法在仅使用SAR图像时能够取得较好的效果。其次，论文还尝试了多种融合方法，将SAR和光学图像结合使用，发现决策级融合在提出的数据集上表现最佳。研究结果表明，SAR数据在与光学模态融合时，特别是在涉及特定土地覆盖类别（如水域）的问题上，能够提供额外的信息。

链接: https://arxiv.org/abs/2501.08131
作者: Lucrezia Tosato,Flora Weissgerber,Laurent Wendling,Sylvain Lobry
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 6 figures

点击查看摘要

Abstract:Remote sensing visual question answering (RSVQA) is a task that automatically extracts information from satellite images and processes a question to predict the answer from the images in textual form, helping with the interpretation of the image. While different methods have been proposed to extract information from optical images with different spectral bands and resolutions, no method has been proposed to answer questions from Synthetic Aperture Radar (SAR) images. SAR images capture electromagnetic information from the scene, and are less affected by atmospheric conditions, such as clouds. In this work, our objective is to introduce SAR in the RSVQA task, finding the best way to use this modality. In our research, we carry out a study on different pipelines for the task of RSVQA taking into account information from both SAR and optical data. To this purpose, we also present a dataset that allows for the introduction of SAR images in the RSVQA framework. We propose two different models to include the SAR modality. The first one is an end-to-end method in which we add an additional encoder for the SAR modality. In the second approach, we build on a two-stage framework. First, relevant information is extracted from SAR and, optionally, optical data. This information is then translated into natural language to be used in the second step which only relies on a language model to provide the answer. We find that the second pipeline allows us to obtain good results with SAR images alone. We then try various types of fusion methods to use SAR and optical images together, finding that a fusion at the decision level achieves the best results on the proposed dataset. We show that SAR data offers additional information when fused with the optical modality, particularly for questions related to specific land cover classes, such as water areas.
zh

[CV-31] Revisiting Birds Eye View Perception Models with Frozen Foundation Models: DINOv2 and Metric3Dv2

【速读】：该论文试图解决鸟瞰图（Birds Eye View, BEV）感知模型在训练数据有限的情况下如何有效提升性能的问题。传统数据集虽然提供了丰富的驾驶场景数据，但在某些情况下数据量仍然不足。论文提出通过集成大型基础模型（如DINOv2和Metric3Dv2）来减少对训练数据的依赖，并超越现有模型的性能。解决方案的关键在于：1）在Lift-Splat-Shoot架构中，使用冻结的DINOv2进行特征提取，并结合Metric3Dv2进行深度估计，从而在仅使用一半训练数据和迭代次数的情况下，显著提升了7.4 IoU的性能；2）在Simple-BEV架构中，创新性地将Metric3Dv2的深度信息作为伪激光雷达（PseudoLiDAR）点云替代传统激光雷达，实现了比仅使用相机的模型高出3 IoU的性能提升。

链接: https://arxiv.org/abs/2501.08118
作者: Seamie Hayes,Ganesh Sistu,Ciarán Eising
机构: Dept. of Electronic and Computer Engineering, University of Limerick(利默里克大学电子与计算机工程系); SFI CRT Foundations in Data Science, University of Limerick(利默里克大学数据科学基础SFI CRT); Data Driven Computer Engineering (D²iCE) Research Centre, University of Limerick(利默里克大学数据驱动计算机工程研究中心); Valeo Vision Systems(法雷奥视觉系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the Electronic Imaging - Autonomous Vehicles and Machines Connference 2025

点击查看摘要

Abstract:Birds Eye View perception models require extensive data to perform and generalize effectively. While traditional datasets often provide abundant driving scenes from diverse locations, this is not always the case. It is crucial to maximize the utility of the available training data. With the advent of large foundation models such as DINOv2 and Metric3Dv2, a pertinent question arises: can these models be integrated into existing model architectures to not only reduce the required training data but surpass the performance of current models? We choose two model architectures in the vehicle segmentation domain to alter: Lift-Splat-Shoot, and Simple-BEV. For Lift-Splat-Shoot, we explore the implementation of frozen DINOv2 for feature extraction and Metric3Dv2 for depth estimation, where we greatly exceed the baseline results by 7.4 IoU while utilizing only half the training data and iterations. Furthermore, we introduce an innovative application of Metric3Dv2’s depth information as a PseudoLiDAR point cloud incorporated into the Simple-BEV architecture, replacing traditional LiDAR. This integration results in a +3 IoU improvement compared to the Camera-only model.
zh

[CV-32] RoHan: Robust Hand Detection in Operation Room

【速读】：该论文旨在解决在手术室（OR）环境中进行手部检测（hand detection）的挑战，特别是在手套手部实例有限、手术室环境复杂（如不同的记录条件、多样的手套颜色和常见的遮挡）的情况下，现有的手部检测模型难以有效适应。为了解决这些问题，论文提出了“RoHan”方法，其关键解决方案包括两个主要阶段：首先，通过“人工手套”（Artificial Gloves）技术对公开的手部数据集进行数据增强，生成戴手套的手部合成图像；其次，采用半监督域适应（semi-supervised domain adaptation）技术，通过迭代预测优化和高效帧过滤，提升模型在真实手术室环境中的检测性能。该方法显著减少了对大量标注和模型训练的需求，为手部检测技术在医疗环境中的实际应用铺平了道路。

链接: https://arxiv.org/abs/2501.08115
作者: Roi Papo,Sapir Gershov,Tom Friedman,Itay Or,Gil Bolotin,Shlomi Laufer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages

点击查看摘要

Abstract:Hand-specific localization has garnered significant interest within the computer vision community. Although there are numerous datasets with hand annotations from various angles and settings, domain transfer techniques frequently struggle in surgical environments. This is mainly due to the limited availability of gloved hand instances and the unique challenges of operating rooms (ORs). Thus, hand-detection models tailored to OR settings require extensive training and expensive annotation processes. To overcome these challenges, we present “RoHan” - a novel approach for robust hand detection in the OR, leveraging advanced semi-supervised domain adaptation techniques to tackle the challenges of varying recording conditions, diverse glove colors, and occlusions common in surgical settings. Our methodology encompasses two main stages: (1) data augmentation strategy that utilizes “Artificial Gloves,” a method for augmenting publicly available hand datasets with synthetic images of hands-wearing gloves; (2) semi-supervised domain adaptation pipeline that improves detection performance in real-world OR settings through iterative prediction refinement and efficient frame filtering. We evaluate our method using two datasets: simulated enterotomy repair and saphenous vein graft harvesting. “RoHan” substantially reduces the need for extensive labeling and model training, paving the way for the practical implementation of hand detection technologies in medical settings.
zh

[CV-33] Change Captioning in Remote Sensing: Evolution to SAT-Cap – A Single-Stage Transformer Approach

【速读】：该论文旨在解决多时相遥感数据变化描述（change captioning）中的两个关键问题：一是由于多阶段融合策略导致的高计算需求，二是由于从单幅图像中提取的语义信息有限，导致对象描述不够详细。为解决这些问题，论文提出了基于Transformer模型的SAT-Cap方法，采用单阶段特征融合策略。其核心创新点包括：1) 空间-通道注意力编码器（Spatial-Channel Attention Encoder），通过联合建模空间和通道信息，显著提升了从多时相遥感图像中提取语义信息的能力；2) 差异引导融合模块（Difference-Guided Fusion module），使用简单的余弦相似度进行信息融合，降低了模型架构的复杂性。实验结果表明，SAT-Cap在LEVIR-CC和DUBAI-CC数据集上的CIDEr得分分别为140.23%和97.74%，超越了当前最先进的方法。

链接: https://arxiv.org/abs/2501.08114
作者: Yuduo Wang,Weikang Yu,Pedram Ghamisi
机构: Helmholtz-Zentrum Dresden-Rossendorf (HZDR)(亥姆霍兹德累斯顿罗森多夫研究中心); Humboldt-Universität zu Berlin(柏林洪堡大学); Technical University of Munich(慕尼黑工业大学); Lancaster University(兰卡斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth’s dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model’s ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.
zh

[CV-34] EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision

【速读】：该论文试图解决在遥感数据上进行自监督学习（self-supervised learning）的挑战，以提升深度学习在地球监测任务中的应用效果。解决方案的关键在于提出了一个名为EarthView的综合数据集，该数据集涵盖了15万亿像素的全球遥感数据，结合了来自多种来源的影像，包括NEON、Sentinel以及Satellogic提供的1米空间分辨率数据。此外，论文还引入了EarthMAE，一种专门为处理遥感数据设计的掩码自编码器（Masked Autoencoder），该模型通过自监督学习有效处理多种数据模态，如高光谱、多光谱、地形数据、分割图和时序结构。通过预训练Satellogic数据，EarthMAE在下游任务中表现出性能提升，展示了这一创新组合在深度学习地球监测领域的进展。

链接: https://arxiv.org/abs/2501.08111
作者: Diego Velazquez,Pau Rodriguez López,Sergio Alonso,Josep M. Gonfaus,Jordi Gonzalez,Gerardo Richarte,Javier Marin,Yoshua Bengio,Alexandre Lacoste
机构: Computer Vision Center; Satellogic; Mila, Université de Montréal; ServiceNow Research; Apple Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2nd Workshop on Computer Vision for Earth Observation (CV4EO) Applications

点击查看摘要

Abstract:This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to enhance deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Our dataset provides a wide spectrum of image data with varying resolutions, harnessed from different sensors and organized coherently into an accessible HuggingFace dataset in parquet format. This data spans five years, from 2017 to 2022. Accompanying the dataset, we introduce EarthMAE, a tailored Masked Autoencoder, developed to tackle the distinct challenges of remote sensing data. Trained in a self-supervised fashion, EarthMAE effectively processes different data modalities such as hyperspectral, multispectral, topographical data, segmentation maps, and temporal structure. This model helps us show that pre-training on Satellogic data improves performance on downstream tasks. While there is still a gap to fill in MAE for heterogeneous data, we regard this innovative combination of an expansive, diverse dataset and a versatile model adapted for self-supervised learning as a stride forward in deep learning for Earth monitoring.
zh

[CV-35] Guiding the classification of hepatocellular carcinoma on 3D CT-scans using deep and handcrafted radiological features

【速读】：该论文试图解决肝细胞癌（Hepatocellular Carcinoma, HCC）在CT图像上的自动诊断问题，以减少放射科医生在解读CT扫描时的个体差异（inter-variability）。当前，HCC的诊断金标准是肝活检（liver biopsy），但在临床实践中，放射科医生通常根据LI-RADS（Liver Imaging Reporting and Data System）标准通过视觉解读CT图像进行诊断。然而，标准的深度学习方法在挑战性数据库上难以准确预测HCC。为此，论文提出了一种基于LI-RADS系统的两步法自动诊断方法，显著提升了预测性能，AUC（Area Under Curve）相较于不同架构的深度学习基线提高了6到18个百分点。该方法在临床验证中表现优异，结果优于非专家放射科医生，并与专家水平相当。解决方案的关键在于借鉴LI-RADS的标准化流程，通过两步法优化深度学习模型的性能，从而实现对HCC的准确预测。

链接: https://arxiv.org/abs/2501.08097
作者: E. Sarfati,A. Bône,M-M. Rohé,C. Aubé,M. Ronot,P. Gori,I. Bloch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE ISBI 2025

点击查看摘要

Abstract:Hepatocellular carcinoma is the most spread primary liver cancer across the world ( \sim 80% of the liver tumors). The gold standard for HCC diagnosis is liver biopsy. However, in the clinical routine, expert radiologists provide a visual diagnosis by interpreting hepatic CT-scans according to a standardized protocol, the LI-RADS, which uses five radiological criteria with an associated decision tree. In this paper, we propose an automatic approach to predict histology-proven HCC from CT images in order to reduce radiologists’ inter-variability. We first show that standard deep learning methods fail to accurately predict HCC from CT-scans on a challenging database, and propose a two-step approach inspired by the LI-RADS system to improve the performance. We achieve improvements from 6 to 18 points of AUC with respect to deep learning baselines trained with different architectures. We also provide clinical validation of our method, achieving results that outperform non-expert radiologists and are on par with expert ones.
zh

[CV-36] Agent Pose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation

【速读】：该论文试图解决在人体姿态估计（human pose estimation）中，由于教师模型（teacher model）和学生模型（student model）之间的容量差距（capacity gap）导致的性能下降问题。现有的姿态蒸馏（pose distillation）方法主要关注教师知识的传递，但往往忽略了容量差距带来的负面影响。为解决这一问题，论文提出了AgentPose，一种新颖的姿态蒸馏方法，通过引入特征代理（feature agent）来建模教师特征（teacher features）的分布，并逐步将学生特征（student features）的分布与教师特征的分布对齐，从而有效克服容量差距，提升知识传递的效果。该方法在COCO数据集上的实验验证了其在高容量差距场景下的有效性。

链接: https://arxiv.org/abs/2501.08088
作者: Feng Zhang,Jinwei Liu,Xiatian Zhu,Lei Chen
机构: Nanjing University of Posts and Telecommunications, Nanjing, China(南京邮电大学); Surrey Institute for People-Centred Artificial Intelligence, University of Surrey Guildford, United Kingdom(萨里大学以人为本人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figures

点击查看摘要

Abstract:Pose distillation is widely adopted to reduce model size in human pose estimation. However, existing methods primarily emphasize the transfer of teacher knowledge while often neglecting the performance degradation resulted from the curse of capacity gap between teacher and student. To address this issue, we propose AgentPose, a novel pose distillation method that integrates a feature agent to model the distribution of teacher features and progressively aligns the distribution of student features with that of the teacher feature, effectively overcoming the capacity gap and enhancing the ability of knowledge transfer. Our comprehensive experiments conducted on the COCO dataset substantiate the effectiveness of our method in knowledge transfer, particularly in scenarios with a high capacity gap.
zh

[CV-37] Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

【速读】：该论文试图解决深度神经网络（DNNs）在复杂开放世界领域（如自动驾驶）中面临的分布偏移问题，特别是针对训练数据分布之外（OOD）场景的识别。由于无法保证对未知新物体（语义偏移）或光照条件（协变量偏移）等情况的绝对鲁棒性，因此需要可靠的运行时监控机制来检测这些OOD场景。现有的OOD分类方法在复杂领域如自动驾驶中未经充分测试，且通常只能检测特定类型的偏移，甚至需要OOD样本的监督。为解决这一问题，论文提出了一种基于无监督和模型无关的框架，通过建立训练数据特征分布的完整模型，并利用其在新数据点的密度作为分布内（ID）评分，从而统一检测各种类型的偏移。具体实现上，论文结合了视觉基础模型（VFM）作为特征提取器，并采用四种不同的密度建模技术。实验表明，VFM特征编码在性能上优于特定偏移的OOD监控方法，且复杂架构优于更大的潜在空间维度。该方法尽管是模型无关的，但能有效识别下游任务中高风险的样本，表明VFM在复杂视觉任务中实现模型无关、无监督且可靠的安全监控具有潜力。

链接: https://arxiv.org/abs/2501.08083
作者: Nert Keser,Halil Ibrahim Orhan,Niki Amini-Naieni,Gesina Schwalbe,Alois Knoll,Matthias Rottmann
机构: Continental AG; Technical University of Munich (慕尼黑工业大学); University of Lübeck (吕贝克大学); University of Oxford (牛津大学); University of Wuppertal (伍珀塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Absolute robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised, and model-agnostic method that unifies detection of all kinds of shifts: Find a full model of the training data’s feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine the newly available Vision Foundation Models (VFM) as feature extractors with one of four alternative density modeling techniques. In an extensive benchmark of 4 VFMs against 20 baselines, we show the superior performance of VFM feature encodings compared to shift-specific OOD monitors. Additionally, we find that sophisticated architectures outperform larger latent space dimensionality; and our method identifies samples with higher risk of errors on downstream tasks, despite being model-agnostic. This suggests that VFMs are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks.
zh

[CV-38] Evaluating Human Perception of Novel View Synthesis: Subjective Quality Assessment of Gaussian Splatting and NeRF in Dynamic Scenes

【速读】：该论文旨在解决新视角合成（Novel View Synthesis, NVS）技术中的质量评估问题，特别是针对高斯泼溅（Gaussian Splatting, GS）和神经辐射场（Neural Radiance Fields, NeRF）这两种前沿技术。尽管已有研究探索了NVS技术的主观质量评估，但在方法选择、场景覆盖和评估方法等方面仍存在挑战。为此，论文通过两项主观实验，重点评估了动态和真实场景中基于GS和NeRF的NVS方法的质量。实验涵盖了360°、正面和单视角视频，并提供了更丰富和大量的真实场景数据。此外，论文首次探讨了NVS方法在动态场景中移动物体的影响。通过这两类主观实验，研究从人类感知角度全面理解了不同视角路径的影响，并为未来开发全参考和无参考质量指标奠定了基础。同时，论文还建立了多种最先进客观指标的基准测试，揭示了现有方法在准确捕捉主观质量方面的不足，为NVS方法的进一步改进提供了重要见解。

链接: https://arxiv.org/abs/2501.08072
作者: Yuhang Zhang,Joshua Maraval,Zhengyu Zhang,Nicolas Ramin,Shishun Tian,Lu Zhang
机构: Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University (广东智能信息处理重点实验室，深圳大学电子与信息工程学院); IRT b<>com (IRT b<>com); School of Electronics and Communication Engineering, Guangzhou University (广州大学电子与通信工程学院); Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164 (雷恩大学, INSA雷恩, 法国国家科学研究中心, IETR - UMR 6164)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Gaussian Splatting (GS) and Neural Radiance Fields (NeRF) are two groundbreaking technologies that have revolutionized the field of Novel View Synthesis (NVS), enabling immersive photorealistic rendering and user experiences by synthesizing multiple viewpoints from a set of images of sparse views. The potential applications of NVS, such as high-quality virtual and augmented reality, detailed 3D modeling, and realistic medical organ imaging, underscore the importance of quality assessment of NVS methods from the perspective of human perception. Although some previous studies have explored subjective quality assessments for NVS technology, they still face several challenges, especially in NVS methods selection, scenario coverage, and evaluation methodology. To address these challenges, we conducted two subjective experiments for the quality assessment of NVS technologies containing both GS-based and NeRF-based methods, focusing on dynamic and real-world scenes. This study covers 360°, front-facing, and single-viewpoint videos while providing a richer and greater number of real scenes. Meanwhile, it’s the first time to explore the impact of NVS methods in dynamic scenes with moving objects. The two types of subjective experiments help to fully comprehend the influences of different viewing paths from a human perception perspective and pave the way for future development of full-reference and no-reference quality metrics. In addition, we established a comprehensive benchmark of various state-of-the-art objective metrics on the proposed database, highlighting that existing methods still struggle to accurately capture subjective quality. The results give us some insights into the limitations of existing NVS methods and may promote the development of new NVS methods.
zh

[CV-39] Skeleton and Font Generation Network for Zero-shot Chinese Character Generation

【速读】：该论文试图解决自动生成中文字体（Automatic Font Generation）中的挑战，特别是由于中文字符数量庞大且结构复杂，导致在生成与训练样本相似但不同的字符时，现有方法容易产生结构偏差，进而导致字符的细微变化被纠正或忽略。为解决这一问题，论文提出了一种新颖的骨架与字体生成网络（Skeleton and Font Generation Network, SFGN），其核心包括骨架生成器（skeleton builder）和字体生成器（font generator）。骨架生成器通过低资源文本输入合成内容特征，使字体生成不再依赖于内容图像输入；而字体生成器则在部首级别对齐内容和风格特征，这一视角在字体生成领域是全新的。此外，论文还通过实验验证了生成错别字在中文纠错任务中的教学价值，进一步证明了该方法的有效性。

链接: https://arxiv.org/abs/2501.08062
作者: Mobai Xue,Jun Du,Zhenrong Zhang,Jiefeng Ma,Qikai Chang,Pengfei Hu,Jianshu Zhang,Yu Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 10 figures

点击查看摘要

Abstract:Automatic font generation remains a challenging research issue, primarily due to the vast number of Chinese characters, each with unique and intricate structures. Our investigation of previous studies reveals inherent bias capable of causing structural changes in characters. Specifically, when generating a Chinese character similar to, but different from, those in the training samples, the bias is prone to either correcting or ignoring these subtle variations. To address this concern, we propose a novel Skeleton and Font Generation Network (SFGN) to achieve a more robust Chinese character font generation. Our approach includes a skeleton builder and font generator. The skeleton builder synthesizes content features using low-resource text input, enabling our technique to realize font generation independently of content image inputs. Unlike previous font generation methods that treat font style as a global embedding, we introduce a font generator to align content and style features on the radical level, which is a brand-new perspective for font generation. Except for common characters, we also conduct experiments on misspelled characters, a substantial portion of which slightly differs from the common ones. Our approach visually demonstrates the efficacy of generated images and outperforms current state-of-the-art font generation methods. Moreover, we believe that misspelled character generation have significant pedagogical implications and verify such supposition through experiments. We used generated misspelled characters as data augmentation in Chinese character error correction tasks, simulating the scenario where students learn handwritten Chinese characters with the help of misspelled characters. The significantly improved performance of error correction tasks demonstrates the effectiveness of our proposed approach and the value of misspelled character generation.
zh

[CV-40] Self-Attentive Spatio-Temporal Calibration for Precise Intermediate Layer Matching in ANN-to-SNN Distillation

【速读】：该论文试图解决脉冲神经网络（SNNs）在低功耗计算中表现出的精度不足问题，尤其是在与人工神经网络（ANNs）相比时。现有的ANN-to-SNN知识蒸馏方法要么仅关注标签信息，忽略了中间层的特征，要么采用逐层方法，忽视了空间和时间上的语义不一致性，导致性能下降。为解决这些问题，论文提出了一种名为自注意力时空校准（SASTC）的新方法。SASTC利用自注意力机制在空间和时间上识别ANN和SNN之间的语义对齐层对，从而自主传递相关语义信息。实验结果表明，SASTC在多个数据集上显著优于现有方法，首次在CIFAR-10和CIFAR-100上实现了SNNs超越ANNs的精度，展示了SNNs在低功耗计算中的巨大潜力。

链接: https://arxiv.org/abs/2501.08049
作者: Di Hong,Yueming Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) are promising for low-power computation due to their event-driven mechanism but often suffer from lower accuracy compared to Artificial Neural Networks (ANNs). ANN-to-SNN knowledge distillation can improve SNN performance, but previous methods either focus solely on label information, missing valuable intermediate layer features, or use a layer-wise approach that neglects spatial and temporal semantic inconsistencies, leading to performance this http URL address these limitations, we propose a novel method called self-attentive spatio-temporal calibration (SASTC). SASTC uses self-attention to identify semantically aligned layer pairs between ANN and SNN, both spatially and temporally. This enables the autonomous transfer of relevant semantic information. Extensive experiments show that SASTC outperforms existing methods, effectively solving the mismatching problem. Superior accuracy results include 95.12% on CIFAR-10, 79.40% on CIFAR-100 with 2 time steps, and 68.69% on ImageNet with 4 time steps for static datasets, and 97.92% on DVS-Gesture and 83.60% on DVS-CIFAR10 for neuromorphic datasets. This marks the first time SNNs have outperformed ANNs on both CIFAR-10 and CIFAR-100, shedding the new light on the potential applications of SNNs.
zh

[CV-41] Exploring visual language models as a powerful tool in the diagnosis of Ewing Sarcoma

【速读】：该论文旨在解决尤文肉瘤（Ewing’s sarcoma, ES）在数字化组织微阵列中的自动化诊断问题，特别是如何从形态学相似的其他软组织或骨肉瘤中准确区分ES。研究的关键解决方案在于比较了两种预训练策略：基于视觉-语言监督（Vision-language supervision, VLS）的方法和完全监督的ImageNet预训练方法。研究结果表明，使用VLS结合领域内数据集显著提高了诊断准确性，同时大幅减少了可训练参数数量和计算成本。这一方法不仅提升了分类预测的准确性，还为病理图像的自动化分析提供了更高效的解决方案。

链接: https://arxiv.org/abs/2501.08042
作者: Alvaro Pastor-Naranjo,Pablo Meseguer,Rocío del Amor,Jose Antonio Lopez-Guerrero,Samuel Navarro,Katia Scotlandi,Antonio Llombart-Bosch,Isidro Machado,Valery Naranjo
机构: 1: Universidad de Castilla-La Mancha (卡斯蒂利亚-拉曼查大学); 2: Instituto de Investigación Sanitaria de Castilla-La Mancha (卡斯蒂利亚-拉曼查卫生研究所); 3: Hospital General Universitario de Albacete (阿尔巴塞特大学总医院); 4: Instituto de Investigación Sanitaria La Fe (拉费卫生研究所); 5: Universidad de Valencia (瓦伦西亚大学); 6: Istituto Ortopedico Rizzoli (里佐利骨科研究所); 7: CIBERONC (西班牙肿瘤研究网络); 8: Instituto de Investigación Sanitaria INCLIVA (INCLIVA卫生研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 2 tables. Oral presentation at KES-InMed 2024 held in Madeira, Portugal

点击查看摘要

Abstract:Ewing’s sarcoma (ES), characterized by a high density of small round blue cells without structural organization, presents a significant health concern, particularly among adolescents aged 10 to 19. Artificial intelligence-based systems for automated analysis of histopathological images are promising to contribute to an accurate diagnosis of ES. In this context, this study explores the feature extraction ability of different pre-training strategies for distinguishing ES from other soft tissue or bone sarcomas with similar morphology in digitized tissue microarrays for the first time, as far as we know. Vision-language supervision (VLS) is compared to fully-supervised ImageNet pre-training within a multiple instance learning paradigm. Our findings indicate a substantial improvement in diagnostic accuracy with the adaption of VLS using an in-domain dataset. Notably, these models not only enhance the accuracy of predicted classes but also drastically reduce the number of trainable parameters and computational costs.
zh

[CV-42] Robust Low-Light Human Pose Estimation through Illumination-Texture Modulation

【速读】：该论文试图解决在极低光照条件下，由于低可见性和高ISO噪声导致的图像中关键视觉细节模糊，进而影响人体姿态估计（human pose estimation）准确性的问题。现有方法依赖于像素级增强，往往会损害图像的语义信息，且无法有效处理极端低光条件下的特征学习。论文提出了一种基于频率的框架，采用“分而治之”的原则，通过动态光照校正（dynamic illumination correction）处理低频分量，低秩去噪（low-rank denoising）处理高频分量，从而有针对性地增强与任务相关的语义和纹理信息。这种针对性的增强方法显著提高了姿态估计的性能，并在多种挑战性低光场景中优于现有方法。

链接: https://arxiv.org/abs/2501.08038
作者: Feng Zhang,Ze Li,Xiatian Zhu,Lei Chen
机构: Nanjing University of Posts and Telecommunications, Nanjing, China(南京邮电大学); Surrey Institute for People-Centred Artificial Intelligence, University of Surrey Guildford, United Kingdom(萨里大学以人为本人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, conference

点击查看摘要

Abstract:As critical visual details become obscured, the low visibility and high ISO noise in extremely low-light images pose a significant challenge to human pose estimation. Current methods fail to provide high-quality representations due to reliance on pixel-level enhancements that compromise semantics and the inability to effectively handle extreme low-light conditions for robust feature learning. In this work, we propose a frequency-based framework for low-light human pose estimation, rooted in the “divide-and-conquer” principle. Instead of uniformly enhancing the entire image, our method focuses on task-relevant information. By applying dynamic illumination correction to the low-frequency components and low-rank denoising to the high-frequency components, we effectively enhance both the semantic and texture information essential for accurate pose estimation. As a result, this targeted enhancement method results in robust, high-quality representations, significantly improving pose estimation performance. Extensive experiments demonstrating its superiority over state-of-the-art methods in various challenging low-light scenarios.
zh

[CV-43] DisCoPatch: Batch Statistics Are All You Need For OOD Detection But Only If You Can Trust Them

【速读】：该论文旨在解决机器学习中的协变量偏移（covariate shift）问题，即数据分布中的细微变化可能导致模型性能下降。论文提出了一种名为DisCoPatch的无监督对抗变分自编码器（Adversarial Variational Autoencoder, VAE）框架，通过利用批量归一化（Batch Normalization, BN）在对抗判别器中形成的独特批量统计特性来检测这些细微的分布变化。具体而言，DisCoPatch在推理过程中使用来自同一图像的图像块（patches）组成批次，确保数据分布的一致性，从而使模型能够依赖批量统计特性。该框架利用VAE生成的次优输出（生成和重建样本）作为负样本来训练判别器，从而增强其区分分布内样本和协变量偏移的能力。通过收紧这一边界，DisCoPatch在公开的OOD检测基准测试中取得了最先进的性能，特别是在ImageNet-1K(-C)数据集上达到了95.5%的AUROC，并在Near-OOD基准测试中超越了所有现有方法。此外，该模型具有25MB的紧凑模型大小，显著降低了延迟，为实际应用中的OOD检测提供了高效且实用的解决方案。

链接: https://arxiv.org/abs/2501.08005
作者: Francisco Caetano,Christiaan Viviers,Luis A. Zavala-Mondragón,Peter H. N. de With,Fons van der Sommen
机构: Eindhoven University of Technology, The Netherlands (埃因霍温理工大学, 荷兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE’s suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code will be made publicly available
zh

[CV-44] Maximizing Uncertainty for Federated learning via Bayesian Optimisation-based Model Poisoning

【速读】：该论文试图解决在联邦学习（Federated Learning, FL）中，恶意用户通过系统性地创建恶意模型参数来破坏模型的预测和生成能力，从而导致模型输出不确定性增加的问题。论文提出了一种名为Delphi的新型模型投毒攻击方法，旨在通过最大化全局模型输出的不确定性来展示恶意行为。解决方案的关键在于利用本地模型第一隐藏层的模型参数与不确定性之间的关系，采用贝叶斯优化（Bayesian Optimisation, Delphi-BO）和最小二乘信任域（Least Squares Trust Region, Delphi-LSTR）两种优化方法，寻找最优的投毒模型参数。通过KL散度（KL Divergence）量化不确定性，最小化预测概率分布与模型输出的不确定分布之间的距离，并建立了攻击有效性的数学证明。数值结果表明，Delphi-BO比Delphi-LSTR诱导了更高的不确定性，突显了联邦学习系统对模型投毒攻击的脆弱性。

链接: https://arxiv.org/abs/2501.08002
作者: Marios Aristodemou,Xiaolan Liu,Yuan Wang,Konstantinos G. Kyriakopoulos,Sangarapillai Lambotharan,Qingsong Wei
机构: Wolfson School of Mechanical, Electrical and Manufacturing Engineering, Loughborough University (拉夫堡大学); Smart Internet Lab, University of Bristol (布里斯托大学); Institute for Digital Technologies, Loughborough University London (拉夫堡大学伦敦校区); Department of Computing and Intelligence, Institute of High Performance Computing, A*STAR (新加坡科技研究局高性能计算研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:As we transition from Narrow Artificial Intelligence towards Artificial Super Intelligence, users are increasingly concerned about their privacy and the trustworthiness of machine learning (ML) technology. A common denominator for the metrics of trustworthiness is the quantification of uncertainty inherent in DL algorithms, and specifically in the model parameters, input data, and model predictions. One of the common approaches to address privacy-related issues in DL is to adopt distributed learning such as federated learning (FL), where private raw data is not shared among users. Despite the privacy-preserving mechanisms in FL, it still faces challenges in trustworthiness. Specifically, the malicious users, during training, can systematically create malicious model parameters to compromise the models predictive and generative capabilities, resulting in high uncertainty about their reliability. To demonstrate malicious behaviour, we propose a novel model poisoning attack method named Delphi which aims to maximise the uncertainty of the global model output. We achieve this by taking advantage of the relationship between the uncertainty and the model parameters of the first hidden layer of the local model. Delphi employs two types of optimisation , Bayesian Optimisation and Least Squares Trust Region, to search for the optimal poisoned model parameters, named as Delphi-BO and Delphi-LSTR. We quantify the uncertainty using the KL Divergence to minimise the distance of the predictive probability distribution towards an uncertain distribution of model output. Furthermore, we establish a mathematical proof for the attack effectiveness demonstrated in FL. Numerical results demonstrate that Delphi-BO induces a higher amount of uncertainty than Delphi-LSTR highlighting vulnerability of FL systems to model poisoning attacks.
zh

[CV-45] Combining imaging and shape features for prediction tasks of Alzheimers disease classification and brain age regression

【速读】：该论文旨在解决脑龄预测（brain age prediction）和阿尔茨海默病分类（Alzheimer’s disease classification）这两个临床相关任务。为了解决这些问题，论文提出了一种结合MRI成像特征和形状特征的模型。该模型的关键在于融合了从ResNet提取的图像嵌入（image embeddings）和从定制图神经网络（graph neural network）提取的形状嵌入（shape embeddings）。形状嵌入来源于15个脑结构的表面网格（surface meshes），能够捕捉详细的几何信息。通过将这些形状特征与T1加权图像的外观特征相结合，论文在两项任务中均观察到预测性能的提升，尤其在分类任务中表现显著。模型在CamCAN、IXI和OASIS3等公开数据集上进行了评估，验证了融合成像和形状特征在脑分析中的有效性。

链接: https://arxiv.org/abs/2501.07994
作者: Nairouz Shehata,Carolina Piçarra,Ben Glocker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We investigate combining imaging and shape features extracted from MRI for the clinically relevant tasks of brain age prediction and Alzheimer’s disease classification. Our proposed model fuses ResNet-extracted image embeddings with shape embeddings from a bespoke graph neural network. The shape embeddings are derived from surface meshes of 15 brain structures, capturing detailed geometric information. Combined with the appearance features from T1-weighted images, we observe improvements in the prediction performance on both tasks, with substantial gains for classification. We evaluate the model using public datasets, including CamCAN, IXI, and OASIS3, demonstrating the effectiveness of fusing imaging and shape features for brain analysis.
zh

[CV-46] GAC-Net_Geometric and attention-based Network for Depth Completion

【速读】：该论文试图解决自动驾驶领域中深度补全（depth completion）任务中的关键问题，即如何将稀疏的LiDAR深度测量数据通过图像引导转化为高质量的密集深度图（dense depth maps）。现有方法通常将深度图视为彩色图像的附加通道，或直接在稀疏数据上进行卷积操作，未能充分利用深度图中的3D几何信息，尤其在复杂边界和稀疏区域表现有限。为解决这些问题，论文提出了一种结合通道注意力机制（channel attention mechanism）和3D全局特征感知（3D global feature perception）的深度补全网络（CGA-Net）。其关键解决方案包括：1）利用PointNet++从稀疏深度图中提取全局3D几何特征，增强低线数LiDAR数据的场景感知能力；2）设计基于通道注意力的多模态特征融合模块，有效整合稀疏深度、RGB图像和3D几何特征；3）结合残差学习和CSPN++优化深度细化阶段，进一步提升边缘区域和复杂场景的补全质量。实验结果表明，CGA-Net在KITTI深度补全数据集上显著提升了密集深度图的预测精度，达到了新的最优性能（SOTA），并在稀疏和复杂场景中表现出强鲁棒性。

链接: https://arxiv.org/abs/2501.07988
作者: Kuang Zhu,Xingli Gan,Min Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13pages,4 figures, 2 tables

点击查看摘要

Abstract:Depth completion is a key task in autonomous driving, aiming to complete sparse LiDAR depth measurements into high-quality dense depth maps through image guidance. However, existing methods usually treat depth maps as an additional channel of color images, or directly perform convolution on sparse data, failing to fully exploit the 3D geometric information in depth maps, especially with limited performance in complex boundaries and sparse areas. To address these issues, this paper proposes a depth completion network combining channel attention mechanism and 3D global feature perception (CGA-Net). The main innovations include: 1) Utilizing PointNet++ to extract global 3D geometric features from sparse depth maps, enhancing the scene perception ability of low-line LiDAR data; 2) Designing a channel-attention-based multimodal feature fusion module to efficiently integrate sparse depth, RGB images, and 3D geometric features; 3) Combining residual learning with CSPN++ to optimize the depth refinement stage, further improving the completion quality in edge areas and complex scenes. Experiments on the KITTI depth completion dataset show that CGA-Net can significantly improve the prediction accuracy of dense depth maps, achieving a new state-of-the-art (SOTA), and demonstrating strong robustness to sparse and complex scenes.
zh

[CV-47] hreshold Attention Network for Semantic Segmentation of Remote Sensing Images

【速读】：该论文旨在解决遥感图像语义分割（Semantic Segmentation）中的两个关键问题：一是自注意力机制（Self-Attention, SA）在捕捉长距离像素依赖关系时带来的计算复杂度指数级增长问题；二是自注意力机制引入的冗余信息对特征表示的负面影响。为了解决这些问题，论文提出了一种新颖的阈值注意力机制（Threshold Attention Mechanism, TAM），该机制通过减少计算量并更好地建模特征图中不同区域之间的相关性，从而提高了分割效果。基于TAM，论文进一步提出了阈值注意力网络（Threshold Attention Network, TANet），该网络包括用于浅层特征全局增强的注意力特征增强模块（Attentional Feature Enhancement Module, AFEM）和用于深层特征多尺度信息获取的阈值注意力金字塔池化模块（Threshold Attention Pyramid Pooling Module, TAPP）。通过在ISPRS Vaihingen和Potsdam数据集上的广泛实验，验证了TANet的有效性和优越性。

链接: https://arxiv.org/abs/2501.07984
作者: Wei Long,Yongjun Zhang,Zhongwei Cui,Yujie Xu,Xuexue Zhang
机构: State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University(贵州大学); School of Mathematics and Big Data, Guizhou Education University(贵州师范学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation of remote sensing images is essential for various applications, including vegetation monitoring, disaster management, and urban planning. Previous studies have demonstrated that the self-attention mechanism (SA) is an effective approach for designing segmentation networks that can capture long-range pixel dependencies. SA enables the network to model the global dependencies between the input features, resulting in improved segmentation outcomes. However, the high density of attentional feature maps used in this mechanism causes exponential increases in computational complexity. Additionally, it introduces redundant information that negatively impacts the feature representation. Inspired by traditional threshold segmentation algorithms, we propose a novel threshold attention mechanism (TAM). This mechanism significantly reduces computational effort while also better modeling the correlation between different regions of the feature map. Based on TAM, we present a threshold attention network (TANet) for semantic segmentation. TANet consists of an attentional feature enhancement module (AFEM) for global feature enhancement of shallow features and a threshold attention pyramid pooling module (TAPP) for acquiring feature information at different scales for deep features. We have conducted extensive experiments on the ISPRS Vaihingen and Potsdam datasets. The results demonstrate the validity and superiority of our proposed TANet compared to the most state-of-the-art models.
zh

[CV-48] V-Trans4Style: Visual Transition Recommendation for Video Production Style Adaptation

【速读】：该论文旨在解决动态视频内容编辑中的风格适应问题，特别是如何将视频适配到不同的制作风格（如纪录片、戏剧、电影或特定YouTube频道的视频制作技术）。解决方案的关键在于提出了V-Trans4Style算法，该算法采用了一种自下而上的方法，通过基于Transformer的编码器-解码器网络（transformer-based encoder-decoder network）来学习推荐时间一致且视觉上无缝的视觉过渡序列。此外，算法引入了一个风格条件模块（style conditioning module），通过激活最大化（activation maximization）迭代调整从解码器获得的视觉过渡，从而更好地捕捉目标视频制作风格的特征。实验结果表明，该算法在AutoTransition++数据集上显著优于现有的过渡推荐方法，并在相似性度量上平均提升了12%的风格捕捉效果。

链接: https://arxiv.org/abs/2501.07983
作者: Pooja Guhan,Tsung-Wei Huang,Guan-Ming Su,Subhadra Gopalakrishnan,Dinesh Manocha
机构: University of Maryland, College Park MD 20740, USA (马里兰大学帕克分校); Dolby Laboratories, Sunnyvale CA 94085, USA (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce V-Trans4Style, an innovative algorithm tailored for dynamic video content editing needs. It is designed to adapt videos to different production styles like documentaries, dramas, feature films, or a specific YouTube channel’s video-making technique. Our algorithm recommends optimal visual transitions to help achieve this flexibility using a more bottom-up approach. We first employ a transformer-based encoder-decoder network to learn recommending temporally consistent and visually seamless sequences of visual transitions using only the input videos. We then introduce a style conditioning module that leverages this model to iteratively adjust the visual transitions obtained from the decoder through activation maximization. We demonstrate the efficacy of our method through experiments conducted on our newly introduced AutoTransition++ dataset. It is a 6k video version of AutoTransition Dataset that additionally categorizes its videos into different production style categories. Our encoder-decoder model outperforms the state-of-the-art transition recommendation method, achieving improvements of 10% to 80% in Recall@K and mean rank values over baseline. Our style conditioning module results in visual transitions that improve the capture of the desired video production style characteristics by an average of around 12% in comparison to other methods when measured with similarity metrics. We hope that our work serves as a foundation for exploring and understanding video production styles further.
zh

[CV-49] Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness

【速读】：该论文旨在解决视频多模态大语言模型（MLLMs）在描述视频中面部表情时面临的两大挑战：一是缺乏足够的数据集和基准测试，二是视频MLLMs的视觉标记容量有限。为解决这些问题，论文提出了一个专门用于动态面部表情描述的指令跟随数据集，包含5,033个高质量视频片段和超过700,000个手动标注的标记，旨在提升视频MLLMs捕捉细微面部表情的能力。此外，论文提出了FaceTrack-MM模型，该模型通过有限的标记数量编码主要角色的面部信息，能够在复杂的多人场景中有效跟踪面部并聚焦于主要角色的表情。论文还引入了一种结合事件提取、关系分类和最长公共子序列（LCS）算法的新评估指标，用于评估生成文本的内容一致性和时间序列一致性，并提出了FEC-Bench基准测试，用于评估现有视频MLLMs在此特定任务中的表现。

链接: https://arxiv.org/abs/2501.07978
作者: Jiaxing Zhao,Boyuan Sun,Xiang Chen,Xihan Wei
机构: Tongyi Group, Alibaba(阿里巴巴通义集团); VCIP, CS, Nankai University(南开大学计算机科学与技术学院视觉计算与图像处理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character’s face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.
zh

[CV-50] Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models AAAI2025

【速读】：该论文旨在解决视频片段检索（Video Moment Retrieval, VMR）任务中存在的两个主要问题：一是现有基于多模态大语言模型（Multimodal Large Language Models, MLLMs）的方法过度依赖高质量数据集和耗时的微调过程；二是在零样本（zero-shot）设置下，现有方法忽略了查询中固有的语言偏差，导致错误的定位。为解决这些问题，论文提出了Moment-GPT，一种无需微调的零样本VMR管道。其关键解决方案包括：首先使用LLaMA-3对查询进行校正和重述，以减少语言偏差；其次设计了一个结合MiniGPT-v2的自适应候选片段生成器；最后利用VideoChatGPT和片段评分器选择最合适的片段。该方法在多个公开数据集上显著优于现有的基于MLLM和零样本的模型。

链接: https://arxiv.org/abs/2501.07972
作者: Yifang Xu,Yunzhuo Sun,Benxiang Zhai,Ming Li,Wenxin Liang,Yang Li,Sidan Du
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.
zh

[CV-51] SkipClick: Combining Quick Responses and Low-Level Features for Interactive Segmentation in Winter Sports Contexts

【速读】：该论文旨在解决在冬季运动场景下的交互式分割（interactive segmentation）问题，即通过用户提供的点击提示（click prompts）来预测高质量的分割掩码（segmentation masks）。论文首先提出了一种基线架构，该架构专门设计为在每次点击后快速响应。随后，作者提出并描述了一系列架构改进，这些改进显著提升了在WSESeg数据集上分割冬季运动装备的性能。关键解决方案包括优化网络架构以提高响应速度和分割精度，特别是在NoC@85（Number of Clicks at 85% IoU）指标上，该模型显著优于SAM和HQ-SAM。此外，该模型在HQSeg-44k数据集上也取得了最先进的结果，并在包含滑雪者掩码的新数据集上进行了测试，进一步验证了其泛化能力。

链接: https://arxiv.org/abs/2501.07960
作者: Robin Schön,Julian Lorenz,Daniel Kienzle,Rainer Lienhart
机构: University of Augsburg(奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 figures, 6 tables, 12 pages

点击查看摘要

Abstract:In this paper, we present a novel architecture for interactive segmentation in winter sports contexts. The field of interactive segmentation deals with the prediction of high-quality segmentation masks by informing the network about the objects position with the help of user guidance. In our case the guidance consists of click prompts. For this task, we first present a baseline architecture which is specifically geared towards quickly responding after each click. Afterwards, we motivate and describe a number of architectural modifications which improve the performance when tasked with segmenting winter sports equipment on the WSESeg dataset. With regards to the average NoC@85 metric on the WSESeg classes, we outperform SAM and HQ-SAM by 2.336 and 7.946 clicks, respectively. When applied to the HQSeg-44k dataset, our system delivers state-of-the-art results with a NoC@90 of 6.00 and NoC@95 of 9.89. In addition to that, we test our model on a novel dataset containing masks for humans during skiing.
zh

[CV-52] AI Guide Dog: Egocentric Path Prediction on Smartphone

【速读】：该论文旨在解决视觉障碍者在室内外环境中的实时导航问题，特别是如何在不同场景下实现安全、高效的导航辅助。论文提出的解决方案关键是一种轻量级的、基于视觉的自我中心导航辅助系统——AI Guide Dog (AIGD)。该系统通过多标签分类方法预测方向指令，确保在多样化环境中的安全通行。此外，AIGD创新性地结合了GPS信号和高层方向信息，实现了目标导向的户外导航，同时通过处理不确定的多路径预测，支持无目标导向的室内导航。该系统的通用性使其成为首个能够同时处理目标导向和探索性导航场景的导航辅助系统，为盲人导航领域设立了新的技术标杆。

链接: https://arxiv.org/abs/2501.07957
作者: Aishwarya Jadhav,Jeffery Cao,Abhishree Shetty,Urvashi Priyam Kumar,Aditi Sharma,Ben Sukboontip,Jayant Sravan Tamarapalli,Jingyi Zhang,Anirudh Koul
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces AI Guide Dog (AIGD), a lightweight egocentric navigation assistance system for visually impaired individuals, designed for real-time deployment on smartphones. AIGD addresses key challenges in blind navigation by employing a vision-only, multi-label classification approach to predict directional commands, ensuring safe traversal across diverse environments. We propose a novel technique to enable goal-based outdoor navigation by integrating GPS signals and high-level directions, while also addressing uncertain multi-path predictions for destination-free indoor navigation. Our generalized model is the first navigation assistance system to handle both goal-oriented and exploratory navigation scenarios across indoor and outdoor settings, establishing a new state-of-the-art in blind navigation. We present methods, datasets, evaluations, and deployment insights to encourage further innovations in assistive navigation systems.
zh

[CV-53] Robust Hyperspectral Image Panshapring via Sparse Spatial-Spectral Representation

【速读】：该论文旨在解决高分辨率高光谱成像（hyperspectral imaging）在硬件限制下面临的获取难题。具体而言，论文提出了一种名为S^3RNet的新框架，用于高光谱图像的全色锐化（pansharpening），通过将低分辨率高光谱图像（LRHSI）与高分辨率多光谱图像（HRMSI）结合，利用稀疏空间-光谱表示（sparse spatial-spectral representation）来提高图像质量。解决方案的关键在于其核心组件——多分支融合网络（Multi-Branch Fusion Network, MBFN），该网络通过并行分支捕捉不同空间和光谱尺度的互补特征。此外，空间-光谱注意力权重块（Spatial-Spectral Attention Weight Block, SSAWB）动态调整特征权重，以保持稀疏表示并抑制噪声和冗余。为了增强特征传播，论文还引入了密集特征聚合块（Dense Feature Aggregation Block, DFAB），通过密集连接模式高效聚合输入特征。这一集成设计使S^3RNet能够在保持计算效率的同时，选择性地强调不同尺度中最具信息量的特征。实验结果表明，S^3RNet在多个评估指标上达到了最先进的性能，尤其在具有挑战性的噪声条件下仍能保持高质量的重建效果。

链接: https://arxiv.org/abs/2501.07953
作者: Chia-Ming Lee,Yu-Fan Lin,Li-Wei Kang,Chih-Chung Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to IGARSS 2025

点击查看摘要

Abstract:High-resolution hyperspectral imaging plays a crucial role in various remote sensing applications, yet its acquisition often faces fundamental limitations due to hardware constraints. This paper introduces S ^3 RNet, a novel framework for hyperspectral image pansharpening that effectively combines low-resolution hyperspectral images (LRHSI) with high-resolution multispectral images (HRMSI) through sparse spatial-spectral representation. The core of S ^3 RNet is the Multi-Branch Fusion Network (MBFN), which employs parallel branches to capture complementary features at different spatial and spectral scales. Unlike traditional approaches that treat all features equally, our Spatial-Spectral Attention Weight Block (SSAWB) dynamically adjusts feature weights to maintain sparse representation while suppressing noise and redundancy. To enhance feature propagation, we incorporate the Dense Feature Aggregation Block (DFAB), which efficiently aggregates inputted features through dense connectivity patterns. This integrated design enables S ^3 RNet to selectively emphasize the most informative features from differnt scale while maintaining computational efficiency. Comprehensive experiments demonstrate that S ^3 RNet achieves state-of-the-art performance across multiple evaluation metrics, showing particular strength in maintaining high reconstruction quality even under challenging noise conditions. The code will be made publicly available.
zh

[CV-54] VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models

【速读】：该论文旨在解决现有基于扩散模型（diffusion models）的无限制对抗样本（Unrestricted Adversarial Examples, UAEs）生成方法在直接从随机噪声生成自然对抗样本（Natural Adversarial Examples, NAEs）时面临的挑战，如生成结果不可控或失真。现有方法通常依赖于参考图像，难以生成高质量的对抗样本。为此，论文提出了VENOM框架，首次通过扩散模型实现文本驱动的高质量无限制对抗样本生成。VENOM的关键创新在于将图像内容生成与对抗合成统一到一个反向扩散过程中，从而在不牺牲攻击成功率（Attack Success Rate, ASR）的情况下生成高保真对抗样本。此外，VENOM引入了一种带动量的自适应对抗引导策略，确保生成的对抗样本与自然图像的分布对齐，进一步提升了生成结果的稳定性和质量。实验表明，VENOM在ASR和图像质量上均优于现有方法，为对抗样本生成和模型防御研究提供了重要进展。

链接: https://arxiv.org/abs/2501.07922
作者: Hui Kuurila-Zhang,Haoyu Chen,Guoying Zhao
机构: University of Oulu, CMVS (奥卢大学, CMVS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks have proven effective in deceiving machine learning models by subtly altering input images, motivating extensive research in recent years. Traditional methods constrain perturbations within l_p -norm bounds, but advancements in Unrestricted Adversarial Examples (UAEs) allow for more complex, generative-model-based manipulations. Diffusion models now lead UAE generation due to superior stability and image quality over GANs. However, existing diffusion-based UAE methods are limited to using reference images and face challenges in generating Natural Adversarial Examples (NAEs) directly from random noise, often producing uncontrolled or distorted outputs. In this work, we introduce VENOM, the first text-driven framework for high-quality unrestricted adversarial examples generation through diffusion models. VENOM unifies image content generation and adversarial synthesis into a single reverse diffusion process, enabling high-fidelity adversarial examples without sacrificing attack success rate (ASR). To stabilize this process, we incorporate an adaptive adversarial guidance strategy with momentum, ensuring that the generated adversarial examples x^* align with the distribution p(x) of natural images. Extensive experiments demonstrate that VENOM achieves superior ASR and image quality compared to prior methods, marking a significant advancement in adversarial example generation and providing insights into model vulnerabilities for improved defense development.
zh

[CV-55] Cloud Removal With PolSAR-Optical Data Fusion Using A Two-Flow Residual Network

【速读】：该论文旨在解决光学遥感图像（optical remote sensing images）因云层覆盖而难以获取完整图像的问题。为此，作者提出了一种基于极化合成孔径雷达（Polarimetric Synthetic Aperture Radar, PolSAR）与光学数据融合的云去除算法（PODF-CR），以实现缺失光学图像的重建。该算法的关键包括：1）编码模块中的两个并行分支，分别提取PolSAR图像和光学图像的特征，并通过动态滤波器（dynamic filters）去除PolSAR图像中的斑点噪声（speckle noise）；2）基于跨跳跃连接（cross-skip connections）的融合块，促进多模态数据信息的交互；3）通过注意力机制（attention mechanism）优化融合特征，为后续解码提供更好的条件；4）解码模块中引入多尺度卷积（multi-scale convolution）以获取多尺度信息。此外，作者使用了一个包含后向散射系数特征图像（backscatter coefficient feature images）和极化特征图像（polarization feature images）的数据集OPT-BCFSAR-PFSAR，以更好地利用散射信息和极化特性辅助光学图像恢复。实验结果表明，该方法在定性和定量评估中均优于现有方法。

链接: https://arxiv.org/abs/2501.07901
作者: Yuxi Wang,Wenjuan Zhang,Bing Zhang
机构: Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院数字地球科学重点实验室); International Research Center of Big Data for Sustainable Development Goals (可持续发展大数据国际研究中心); College of Resources and Environment, University of Chinese Academy of Sciences (中国科学院大学资源与环境学院); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Optical remote sensing images play a crucial role in the observation of the Earth’s surface. However, obtaining complete optical remote sensing images is challenging due to cloud cover. Reconstructing cloud-free optical images has become a major task in recent years. This paper presents a two-flow Polarimetric Synthetic Aperture Radar (PolSAR)-Optical data fusion cloud removal algorithm (PODF-CR), which achieves the reconstruction of missing optical images. PODF-CR consists of an encoding module and a decoding module. The encoding module includes two parallel branches that extract PolSAR image features and optical image features. To address speckle noise in PolSAR images, we introduce dynamic filters in the PolSAR branch for image denoising. To better facilitate the fusion between multimodal optical images and PolSAR images, we propose fusion blocks based on cross-skip connections to enable interaction of multimodal data information. The obtained fusion features are refined through an attention mechanism to provide better conditions for the subsequent decoding of the fused images. In the decoding module, multi-scale convolution is introduced to obtain multi-scale information. Additionally, to better utilize comprehensive scattering information and polarization characteristics to assist in the restoration of optical images, we use a dataset for cloud restoration called OPT-BCFSAR-PFSAR, which includes backscatter coefficient feature images and polarization feature images obtained from PoLSAR data and optical images. Experimental results demonstrate that this method outperforms existing methods in both qualitative and quantitative evaluations.
zh

[CV-56] Demographic Variability in Face Image Quality Measures

【速读】：该论文旨在解决面部图像质量评估（FIQA）算法在人口统计学变量（如年龄、性别和肤色）上的潜在偏差问题。随着FIQA算法被广泛应用于在线身份管理系统中，确保这些算法在不同人口群体中的公平性和一致性变得尤为重要。论文的核心解决方案是通过评估ISO/IEC 29794-5国际标准中包含的所有面部图像质量度量，分析其在年龄、性别和肤色三个变量上的表现。研究结果表明，大多数质量度量在不同人口群体中没有明显的偏差，仅有两个质量度量在肤色变量上表现出显著的差异。这一发现为未来制定缓解措施提供了重要依据，以确保FIQA算法的公平性和可靠性。

链接: https://arxiv.org/abs/2501.07898
作者: Wassim Kabbani,Kiran Raja,Raghavendra Ramachandra,Christoph Busch
机构: IIK, Info. Sec. and Comm. Technology, Gjovik, Norway (信息安全和通信技术研究所, 挪威耶维克); Department of Computer Science, Gjovik, Norway (计算机科学系, 挪威耶维克)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face image quality assessment (FIQA) algorithms are being integrated into online identity management applications. These applications allow users to upload a face image as part of their document issuance process, where the image is then run through a quality assessment process to make sure it meets the quality and compliance requirements. Concerns about demographic bias have been raised about biometric systems, given the societal implications this may cause. It is therefore important that demographic variability in FIQA algorithms is assessed such that mitigation measures can be created. In this work, we study the demographic variability of all face image quality measures included in the ISO/IEC 29794-5 international standard across three demographic variables: age, gender, and skin tone. The results are rather promising and show no clear bias toward any specific demographic group for most measures. Only two quality measures are found to have considerable variations in their outcomes for different groups on the skin tone variable.
zh

[CV-57] arsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

【速读】：该论文旨在解决生成详细且准确的视频描述以及提升通用视频理解能力的问题。为此，作者提出了Tarsier2，一种先进的大规模视觉-语言模型（Large Vision-Language Model, LVLM）。解决方案的关键在于三个主要升级：(1) 将预训练数据从1100万（11M）视频-文本对扩展到4000万（40M），显著增加了数据的规模和多样性；(2) 在监督微调过程中进行细粒度的时间对齐，以提升模型对视频时序信息的理解；(3) 使用基于模型的采样方法自动构建偏好数据，并应用直接偏好优化（Direct Preference Optimization, DPO）进行模型优化。通过这些改进，Tarsier2-7B在多个基准测试中超越了现有的领先模型，如GPT-4o和Gemini 1.5 Pro，展示了其在视频描述生成和通用视频理解任务中的卓越性能。

链接: https://arxiv.org/abs/2501.07888
作者: Liping Yuan,Jiawei Wang,Haomiao Sun,Yuchen Zhang,Yuan Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6% performance advantage over GPT-4o and +24.9% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.
zh

[CV-58] Mitigating Algorithmic Bias in Multiclass CNN Classifications Using Causal Modeling

【速读】：该论文旨在解决多类分类问题中的算法偏见（algorithmic bias）问题，特别是在性别分类中的偏见。研究发现，在使用卷积神经网络（CNN）进行情感分类时，存在性别偏见：女性更容易被分类为“高兴”或“悲伤”，而男性则更容易被分类为“中性”。为了解决这一问题，研究采用了因果建模（causal modeling）和一对其余（one-vs-all, OvA）技术。具体而言，研究为每个情感类别构建了因果模型，以调整CNN模型的预测类别概率，并通过选择最高概率的类别来聚合调整后的概率。最终，经过去偏处理的分类结果在所有类别中表现出更高的性别公平性，且对整体准确性的影响微乎其微，甚至略有提升。该研究表明，算法公平性和准确性并非必然的权衡关系。

链接: https://arxiv.org/abs/2501.07885
作者: Min Sik Byun,Wendy Wan Yee Hui,Wai Kwong Lau
机构: Singapore Institute of Technology(新加坡理工学院); University of Western Australia(西澳大利亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages; 6 figures

点击查看摘要

Abstract:This study describes a procedure for applying causal modeling to detect and mitigate algorithmic bias in a multiclass classification problem. The dataset was derived from the FairFace dataset, supplemented with emotional labels generated by the DeepFace pre-trained model. A custom Convolutional Neural Network (CNN) was developed, consisting of four convolutional blocks, followed by fully connected layers and dropout layers to mitigate overfitting. Gender bias was identified in the CNN model’s classifications: Females were more likely to be classified as “happy” or “sad,” while males were more likely to be classified as “neutral.” To address this, the one-vs-all (OvA) technique was applied. A causal model was constructed for each emotion class to adjust the CNN model’s predicted class probabilities. The adjusted probabilities for the various classes were then aggregated by selecting the class with the highest probability. The resulting debiased classifications demonstrated enhanced gender fairness across all classes, with negligible impact–or even a slight improvement–on overall accuracy. This study highlights that algorithmic fairness and accuracy are not necessarily trade-offs. All data and code for this study are publicly available for download.
zh

[CV-59] Make-A-Character 2: Animatable 3D Character Generation From a Single Image

【速读】：该论文旨在解决从单张肖像照片生成高质量3D角色（3D character）的挑战，特别是在游戏开发和数字人应用中的需求。解决方案的关键在于Make-A-Character 2系统的多项改进：首先，使用IC-Light方法校正输入照片中的非理想光照条件，并通过基于神经网络的色彩校正技术使照片中的肤色与游戏引擎渲染结果保持一致。其次，采用层次化表示网络（Hierarchical Representation Network）捕捉高频面部结构，并结合自适应骨骼校准技术，以实现精确且富有表现力的面部动画。此外，系统利用Transformer架构生成伴随语音的面部和手势动作，使生成的3D角色能够实时进行对话。这些技术的集成显著提升了图像到3D角色生成的效率，整个过程耗时不到2分钟，并已应用于对话式AI虚拟人产品中。

链接: https://arxiv.org/abs/2501.07870
作者: Lin Liu,Yutong Wang,Jiahao Chen,Jianfang Li,Tangli Xue,Longlong Li,Jianqiang Ren,Liefeng Bo
机构: Tongyi Lab, Alibaba Group(通义实验室, 阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:This report introduces Make-A-Character 2, an advanced system for generating high-quality 3D characters from single portrait photographs, ideal for game development and digital human applications. Make-A-Character 2 builds upon its predecessor by incorporating several significant improvements for image-based head generation. We utilize the IC-Light method to correct non-ideal illumination in input photos and apply neural network-based color correction to harmonize skin tones between the photos and game engine renders. We also employ the Hierarchical Representation Network to capture high-frequency facial structures and conduct adaptive skeleton calibration for accurate and expressive facial animations. The entire image-to-3D-character generation process takes less than 2 minutes. Furthermore, we leverage transformer architecture to generate co-speech facial and gesture actions, enabling real-time conversation with the generated character. These technologies have been integrated into our conversational AI avatar products.
zh

[CV-60] deepTerra – AI Land Classification Made Easy

【速读】：该论文旨在解决利用机器学习和卫星影像进行地表特征分类的复杂性问题。解决方案的关键在于开发了一个名为deepTerra的综合平台，该平台集成了数据收集、图像增强、训练、测试和预测等多个模块，从而简化了图像分类任务的整个工作流程。通过这一平台，研究人员能够更高效地进行地表特征的分类和分析，推动了相关研究领域的发展。

链接: https://arxiv.org/abs/2501.07859
作者: Andrew Keith Wilkinson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:deepTerra is a comprehensive platform designed to facilitate the classification of land surface features using machine learning and satellite imagery. The platform includes modules for data collection, image augmentation, training, testing, and prediction, streamlining the entire workflow for image classification tasks. This paper presents a detailed overview of the capabilities of deepTerra, shows how it has been applied to various research areas, and discusses the future directions it might take.
zh

[CV-61] State-of-the-Art Transformer Models for Image Super-Resolution: Techniques Challenges and Applications

【速读】：该论文旨在解决图像超分辨率（Image Super-Resolution, SR）中的关键问题，即如何从低分辨率图像中恢复出高分辨率图像，并提升细节和视觉质量。传统方法如基于卷积神经网络（CNN）和生成对抗网络（GAN）的方法存在感受野有限、全局上下文捕捉能力不足以及高频细节恢复困难等局限性。论文提出的解决方案之关键在于利用基于Transformer的方法，这些方法通过结合Transformer与传统网络架构，能够更好地平衡全局和局部上下文信息，从而实现更高质量的重建效果。通过分析最新的Transformer-based SR模型及其创新技术和架构，论文揭示了未来研究的潜在方向，并为深度学习领域的研究者提供了结构化的研究路线图。

链接: https://arxiv.org/abs/2501.07855
作者: Debasish Dutta,Deepjyoti Chetia,Neeharika Sonowal,Sanjib Kr Kalita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 8 pages

点击查看摘要

Abstract:Image Super-Resolution (SR) aims to recover a high-resolution image from its low-resolution counterpart, which has been affected by a specific degradation process. This is achieved by enhancing detail and visual quality. Recent advancements in transformer-based methods have remolded image super-resolution by enabling high-quality reconstructions surpassing previous deep-learning approaches like CNN and GAN-based. This effectively addresses the limitations of previous methods, such as limited receptive fields, poor global context capture, and challenges in high-frequency detail recovery. Additionally, the paper reviews recent trends and advancements in transformer-based SR models, exploring various innovative techniques and architectures that combine transformers with traditional networks to balance global and local contexts. These neoteric methods are critically analyzed, revealing promising yet unexplored gaps and potential directions for future research. Several visualizations of models and techniques are included to foster a holistic understanding of recent trends. This work seeks to offer a structured roadmap for researchers at the forefront of deep learning, specifically exploring the impact of transformers on super-resolution techniques.
zh

[CV-62] 3UR-LLM : An End-to-End Multimodal Large Language Model for 3D Scene Understanding

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在从2D任务过渡到3D场景理解时面临的挑战，特别是在空间位置、相互关系及因果逻辑的识别方面。主要问题包括：1）3D场景数据的高标注成本限制了数据规模的扩展；2）缺乏直接有效的方法来感知3D信息，导致训练时间延长并增加了框架的复杂性。为解决这些问题，作者提出了一种基于开源2D MLLMs和大语言模型（LLMs）的流水线，用于生成高质量的3D-文本对，并构建了3DS-160K数据集以增强预训练过程。在此基础上，作者提出了3UR-LLM模型，这是一种端到端的3D MLLM，能够精确解释3D场景，并在处理物理世界复杂性方面表现出色。3UR-LLM直接接收3D点云作为输入，并将3D特征与文本指令融合投影为可管理的token集。为减轻混合token带来的计算负担，作者设计了一个3D压缩模块，用于协同压缩3D空间线索和文本叙述。3UR-LLM在ScanQA等任务上表现优异，超越了现有最佳模型（SOTAs），例如在CIDEr指标上提升了7.1%，同时减少了训练资源的使用。

链接: https://arxiv.org/abs/2501.07819
作者: Haomiao Xiong,Yunzhi Zhuge,Jiawen Zhu,Lu Zhang,Huchuan Lu
机构: School of Information and Communication Engineering, Dalian University of Technology, Dalian 116081, China (大连理工大学信息与通信工程学院); School of Future Technology, Dalian University of Technology, Dalian 116081, China (大连理工大学未来技术学院); School of Artificial Intelligence, Dalian University of Technology, Dalian 116081, China (大连理工大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) exhibit impressive capabilities in 2D tasks, yet encounter challenges in discerning the spatial positions, interrelations, and causal logic in scenes when transitioning from 2D to 3D representations. We find that the limitations mainly lie in: i) the high annotation cost restricting the scale-up of volumes of 3D scene data, and ii) the lack of a straightforward and effective way to perceive 3D information which results in prolonged training durations and complicates the streamlined framework. To this end, we develop pipeline based on open-source 2D MLLMs and LLMs to generate high-quality 3D-text pairs and construct 3DS-160K , to enhance the pre-training process. Leveraging this high-quality pre-training data, we introduce the 3UR-LLM model, an end-to-end 3D MLLM designed for precise interpretation of 3D scenes, showcasing exceptional capability in navigating the complexities of the physical world. 3UR-LLM directly receives 3D point cloud as input and project 3D features fused with text instructions into a manageable set of tokens. Considering the computation burden derived from these hybrid tokens, we design a 3D compressor module to cohesively compress the 3D spatial cues and textual narrative. 3UR-LLM achieves promising performance with respect to the previous SOTAs, for instance, 3UR-LLM exceeds its counterparts by 7.1% CIDEr on ScanQA, while utilizing fewer training resources. The code and model weights for 3UR-LLM and the 3DS-160K benchmark are available at 3UR-LLM.
zh

[CV-63] AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation

【速读】：该论文旨在解决音视频分割（Audio-Visual Segmentation, AVS）任务中的关键挑战，即如何在视频流中准确定位和分割发出声音的物体。现有基于Transformer的方法在处理长程依赖关系时，由于二次方的计算成本，在复杂场景中表现受限。为解决这一问题，论文提出了AVS-Mamba，一种选择性状态空间模型（Selective State Space Model），通过线性复杂度实现复杂的多模态理解。解决方案的关键在于两个核心组件：Temporal Mamba Block用于序列视频处理，Vision-to-Audio Fusion Block用于高级的音视频融合。此外，论文还提出了多尺度时间编码器（Multi-scale Temporal Encoder）和模态聚合解码器（Modality Aggregation Decoder），分别用于增强跨尺度的视觉特征学习和多模态特征融合。最后，通过上下文集成金字塔（Contextual Integration Pyramid）实现音视频的时空上下文协作。这些创新使得该方法在AVSBench-object和AVSBench-semantic数据集上达到了新的最优性能。

链接: https://arxiv.org/abs/2501.07810
作者: Sitong Gong,Yunzhi Zhuge,Lu Zhang,Yifan Wang,Pingping Zhang,Lijun Wang,Huchuan Lu
机构: School of Information and Communication Engineering, Dalian University of Technology, Dalian 116081, China (大连理工大学信息与通信工程学院); School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116081, China (大连理工大学创新创业学院); School of Future Technology and the School of Artificial Intelligence, Dalian University of Technology, Dalian 116081, China (大连理工大学未来技术学院与人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets. Our source code and model weights are available at AVS-Mamba.
zh

[CV-64] A Low-cost and Ultra-lightweight Binary Neural Network for Traffic Signal Recognition

【速读】：该论文试图解决在资源受限平台（如车辆平台和可穿戴AIOT设备）上部署深度学习模型时面临的挑战，包括模型资源占用大、结构复杂和高功耗等问题。为解决这些问题，作者提出了一种超轻量级的二值神经网络（BNN）模型，专为硬件部署设计。该模型在推理阶段仅依赖逻辑运算和低比特宽度的定点加减运算，显著简化了处理单元（PE）的设计复杂度。通过在德国交通标志识别基准（GTSRB）、中国交通标志（CTS）和比利时交通标志（BTS）数据集上的验证，该模型在GTSRB数据集上达到了97.64%的识别准确率，且与全精度模型相比，准确率损失控制在1%以内，参数存储开销仅为全精度模型的10%。这一解决方案的关键在于通过二值化和低比特运算大幅降低了模型的复杂性和资源需求，同时保持了较高的识别性能，展示了BNN在自动驾驶相关计算机视觉任务硬件部署中的巨大潜力。

链接: https://arxiv.org/abs/2501.07808
作者: Mingke Xiao,Yue Su,Liang Yu,Guanglong Qu,Yutong Jia,Yukuan Chang,Xu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The deployment of neural networks in vehicle platforms and wearable Artificial Intelligence-of-Things (AIOT) scenarios has become a research area that has attracted much attention. With the continuous evolution of deep learning technology, many image classification models are committed to improving recognition accuracy, but this is often accompanied by problems such as large model resource usage, complex structure, and high power consumption, which makes it challenging to deploy on resource-constrained platforms. Herein, we propose an ultra-lightweight binary neural network (BNN) model designed for hardware deployment, and conduct image classification research based on the German Traffic Sign Recognition Benchmark (GTSRB) dataset. In addition, we also verify it on the Chinese Traffic Sign (CTS) and Belgian Traffic Sign (BTS) datasets. The proposed model shows excellent recognition performance with an accuracy of up to 97.64%, making it one of the best performing BNN models in the GTSRB dataset. Compared with the full-precision model, the accuracy loss is controlled within 1%, and the parameter storage overhead of the model is only 10% of that of the full-precision model. More importantly, our network model only relies on logical operations and low-bit width fixed-point addition and subtraction operations during the inference phase, which greatly simplifies the design complexity of the processing element (PE). Our research shows the great potential of BNN in the hardware deployment of computer vision models, especially in the field of computer vision tasks related to autonomous driving.
zh

[CV-65] Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation

【速读】：该论文旨在解决无监督视频对象分割（Unsupervised Video Object Segmentation, UVOS）中的挑战，提出了一种名为MTNet的高效算法。该算法的关键在于同时利用运动（motion）和时间（temporal）线索，将外观（appearance）和运动特征在编码器（encoder）的特征提取过程中有效融合，从而生成更具互补性的表示。此外，论文引入了一个时间变换器模块（temporal transformer module），以捕捉视频中复杂的长期上下文动态信息，并通过多级解码器（cascade of decoders）优化利用提取的特征，生成更精确的分割掩码。MTNet通过结合时间和跨模态知识，能够在各种复杂场景中准确定位和跟踪主要对象，并在多个基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.07806
作者: Yunzhi Zhuge,Hongyu Gu,Lu Zhang,Jinqing Qi,Huchuan Lu
机构: School of Information and Communication Engineering, Dalian University of Technology, China (大连理工大学信息与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

点击查看摘要

Abstract:In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious inter-frame interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross-modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state-of-the-art performance in unsupervised video object segmentation but also delivers competitive results in video salient object detection. These findings highlight the method’s robust versatility and its adeptness in adapting to a range of segmentation tasks. Source code is available on this https URL.
zh

[CV-66] Balance Divergence for Knowledge Distillation

【速读】：该论文试图解决知识蒸馏（Knowledge Distillation）过程中，由于Kullback-Leibler散度（Kullback-Leibler divergence）在模仿教师网络（teacher network）和学生网络（student network）之间的logit输出概率时，可能忽略教师网络中极小概率的“暗知识”（dark knowledge）的问题。这种忽略可能导致蒸馏过程中logit模仿效果不佳，进而导致学生网络获取的信息不平衡。为解决这一问题，论文提出了一种名为“平衡散度蒸馏”（Balance Divergence Distillation）的新方法。该方法的关键在于引入反向Kullback-Leibler散度（reverse Kullback-Leibler divergence）作为补偿操作，以更好地建模教师网络中极小概率的负值部分，同时保留正值的有效学习能力。此外，论文还探讨了不同温度系数（temperature coefficients）调整对知识传递平衡的影响。实验结果表明，该方法在多个计算机视觉任务中显著提升了轻量级学生网络的性能。

链接: https://arxiv.org/abs/2501.07804
作者: Yafei Qi,Chen Wang,Zhaoning Zhang,Yaping Liu,Yongmin Zhang
机构: CSU (Central South University, 中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher’s ‘‘dark knowledge’’ because the divergence calculations may ignore the effect of the minute probabilities from the teacher’s logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback-Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR-100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP-ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.
zh

[CV-67] BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

【速读】：该论文旨在解决从单目视频中预测生物力学准确的3D人体姿态的问题。现有的参数化模型（如SMPL）在捕捉真实关节位置和运动时存在解剖结构过度简化的问题，限制了其在生物力学、医疗和机器人领域的应用。而生物力学准确的姿态估计通常依赖于昂贵的标记式运动捕捉系统和专业实验室中的优化技术。为了弥补这一差距，论文提出了BioPose，一种基于学习的新框架，直接从单目视频中预测生物力学准确的3D人体姿态。BioPose的关键解决方案包括三个核心组件：多查询人体网格恢复模型（MQ-HMR）、神经逆向运动学模型（NeurIK）和基于2D信息的姿态优化技术。MQ-HMR利用多查询可变形Transformer提取多尺度细粒度图像特征，实现精确的人体网格恢复；NeurIK将网格顶点视为虚拟标记，在解剖约束下通过时空网络回归生物力学准确的3D姿态；2D信息优化步骤则通过将3D结构与2D姿态观测对齐，进一步提升了3D姿态估计的精度。实验结果表明，BioPose在基准数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2501.07800
作者: Farnoosh Koleini,Muhammad Usama Saleem,Pu Wang,Hongfei Xue,Ahmed Helmy,Abbey Fenwick
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in 3D human pose estimation from single-camera images and videos have relied on parametric models, like SMPL. However, these models oversimplify anatomical structures, limiting their accuracy in capturing true joint locations and movements, which reduces their applicability in biomechanics, healthcare, and robotics. Biomechanically accurate pose estimation, on the other hand, typically requires costly marker-based motion capture systems and optimization techniques in specialized labs. To bridge this gap, we propose BioPose, a novel learning-based framework for predicting biomechanically accurate 3D human pose directly from monocular videos. BioPose includes three key components: a Multi-Query Human Mesh Recovery model (MQ-HMR), a Neural Inverse Kinematics (NeurIK) model, and a 2D-informed pose refinement technique. MQ-HMR leverages a multi-query deformable transformer to extract multi-scale fine-grained image features, enabling precise human mesh recovery. NeurIK treats the mesh vertices as virtual markers, applying a spatial-temporal network to regress biomechanically accurate 3D poses under anatomical constraints. To further improve 3D pose estimations, a 2D-informed refinement step optimizes the query tokens during inference by aligning the 3D structure with 2D pose observations. Experiments on benchmark datasets demonstrate that BioPose significantly outperforms state-of-the-art methods. Project website: \urlthis https URL.
zh

[CV-68] BMIP: Bi-directional Modality Interaction Prompt Learning for VLM

【速读】：该论文试图解决现有视觉-语言模型（Vision-Language Models, VLMs）在提示学习（prompt learning）中主要关注单模态提示或单向模态交互的问题，忽视了视觉和语言模态之间交互带来的强大对齐效果。为此，论文提出了一种新颖的提示学习方法，称为双向模态交互提示（Bi-directional Modality Interaction Prompt, BMIP）。该方法的解决方案关键在于通过动态加权双模态信息，利用注意力层的信息学习，增强了模型的可训练性和模态间的一致性，相比简单的信息聚合方法具有显著优势。此外，论文还提出了一种更为现实的评估范式，称为开放世界泛化（open-world generalization），以补充广泛采用的跨数据集迁移和领域泛化任务。实验结果表明，BMIP在三种评估范式中均优于当前最先进的方法，并且能够灵活地与其他基于提示的方法结合，进一步提升性能。

链接: https://arxiv.org/abs/2501.07769
作者: Song-Lin Lv,Yu-Yang Chen,Zhi Zhou,Ming Yang,Lan-Zhe Guo
机构: School of Intelligence Science and Technology, Nanjing University, China(南京大学智能科学与技术学院); School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院); National Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学软件新技术国家重点实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called \underline\textbfBi-directional \underline\textbfModality \underline\textbfInteraction \underline\textbfPrompt (BMIP) , which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
zh

[CV-69] PSReg: Prior-guided Sparse Mixture of Experts for Point Cloud Registration AAAI2025

【速读】：该论文试图解决点云配准（point cloud registration）中特征区分度不足的问题，特别是在重叠区域中模糊结构的区分上。现有方法通过区分非重叠区域和重叠区域的点来提升特征区分度，但在重叠区域中仍难以有效区分模糊结构，导致提取的特征存在大量异常匹配。为解决这一问题，论文提出了一种基于先验引导的稀疏混合专家（prior-guided SMoE）配准方法，通过将潜在对应点分配到同一专家进行处理，从而提升特征的区分度。具体而言，论文提出了一个先验引导的SMoE模块，通过融合先验重叠信息和潜在对应点嵌入进行路由分配，将点云特征分配给最合适的专家进行处理。此外，论文还提出了一种结合Transformer层和先验引导SMoE模块的配准框架，不仅关注定位点云重叠区域的重要性，还致力于在重叠区域中找到更准确的对应点。实验结果表明，该方法在3DMatch/3DLoMatch基准测试中达到了最先进的配准召回率（95.7%/79.3%），并在ModelNet40数据集上表现出色。

链接: https://arxiv.org/abs/2501.07762
作者: Xiaoshui Huang,Zhou Huang,Yifan Zuo,Yongshun Gong,Chengdong Zhang,Deyang Liu,Yuming Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:The discriminative feature is crucial for point cloud registration. Recent methods improve the feature discriminative by distinguishing between non-overlapping and overlapping region points. However, they still face challenges in distinguishing the ambiguous structures in the overlapping regions. Therefore, the ambiguous features they extracted resulted in a significant number of outlier matches from overlapping regions. To solve this problem, we propose a prior-guided SMoE-based registration method to improve the feature distinctiveness by dispatching the potential correspondences to the same experts. Specifically, we propose a prior-guided SMoE module by fusing prior overlap and potential correspondence embeddings for routing, assigning tokens to the most suitable experts for processing. In addition, we propose a registration framework by a specific combination of Transformer layer and prior-guided SMoE module. The proposed method not only pays attention to the importance of locating the overlapping areas of point clouds, but also commits to finding more accurate correspondences in overlapping areas. Our extensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art registration recall (95.7%/79.3%) on the 3DMatch/3DLoMatch benchmark. Moreover, we also test the performance on ModelNet40 and demonstrate excellent performance.
zh

[CV-70] Universal Training of Neural Networks to Achieve Bayes Optimal Classification Accuracy ICASSP2025

【速读】：该论文试图解决在分类任务中如何更有效地逼近贝叶斯错误率（Bayes error rate）的问题。传统的损失函数如交叉熵（cross-entropy）虽然在许多任务中表现良好，但在某些具有挑战性的数据集上可能无法达到最优性能。为此，作者提出了一种基于f-散度（f-divergence）的新上界，用于估计贝叶斯错误率，并通过参数化模型的输出来计算该上界。基于这一理论，作者引入了贝叶斯最优学习阈值（Bayes Optimal Learning Threshold, BOLT）损失函数，其最小化能够使分类模型逼近贝叶斯错误率。通过在MNIST、Fashion-MNIST、CIFAR-10和IMDb等数据集上的实验验证，BOLT损失函数在图像和文本分类任务中表现优异，尤其是在具有挑战性的数据集上，其性能优于或与交叉熵相当，展示了其在提升模型泛化能力方面的潜力。

链接: https://arxiv.org/abs/2501.07754
作者: Mohammadreza Tavasoli Naeini,Ali Bereyhi,Morteza Noshad,Ben Liang,Alfred O. Hero III
机构: University of Michigan(密歇根大学); University of Toronto(多伦多大学); Stanford University(斯坦福大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:This work invokes the notion of f -divergence to introduce a novel upper bound on the Bayes error rate of a general classification task. We show that the proposed bound can be computed by sampling from the output of a parameterized model. Using this practical interpretation, we introduce the Bayes optimal learning threshold (BOLT) loss whose minimization enforces a classification model to achieve the Bayes error rate. We validate the proposed loss for image and text classification tasks, considering MNIST, Fashion-MNIST, CIFAR-10, and IMDb datasets. Numerical experiments demonstrate that models trained with BOLT achieve performance on par with or exceeding that of cross-entropy, particularly on challenging datasets. This highlights the potential of BOLT in improving generalization.
zh

[CV-71] Boosting Sclera Segmentation through Semi-supervised Learning with Fewer Labels

【速读】：该论文试图解决巩膜分割（sclera segmentation）领域中高质量标注数据稀缺的问题。传统深度学习方法依赖于大量标注数据，而获取这些数据通常需要昂贵的医疗采集和专业标注，限制了模型的广泛应用。为解决这一问题，论文提出了一种基于半监督学习（semi-supervised learning）的巩膜分割框架，其关键创新在于结合了领域特定的改进和基于图像的空间变换（image-based spatial transformations），从而在有限标注样本的情况下显著提升了分割性能。此外，作者还构建了一个真实世界的眼病诊断数据集，进一步验证了该方法的有效性和优越性。

链接: https://arxiv.org/abs/2501.07750
作者: Guanjun Wang,Lu Wang,Ning Niu,Qiaoyi Yao,Yixuan Wang,Sufen Ren,Shengchao Chen
机构: Hainan University(海南大学); University of Technology Sydney(悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, 19 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Sclera segmentation is crucial for developing automatic eye-related medical computer-aided diagnostic systems, as well as for personal identification and verification, because the sclera contains distinct personal features. Deep learning-based sclera segmentation has achieved significant success compared to traditional methods that rely on hand-crafted features, primarily because it can autonomously extract critical output-related features without the need to consider potential physical constraints. However, achieving accurate sclera segmentation using these methods is challenging due to the scarcity of high-quality, fully labeled datasets, which depend on costly, labor-intensive medical acquisition and expertise. To address this challenge, this paper introduces a novel sclera segmentation framework that excels with limited labeled samples. Specifically, we employ a semi-supervised learning method that integrates domain-specific improvements and image-based spatial transformations to enhance segmentation performance. Additionally, we have developed a real-world eye diagnosis dataset to enrich the evaluation process. Extensive experiments on our dataset and two additional public datasets demonstrate the effectiveness and superiority of our proposed method, especially with significantly fewer labeled samples.
zh

[CV-72] Fixing the Scale and Shift in Monocular Depth For Camera Pose Estimation

【速读】：该论文旨在解决从单目深度预测（monocular depth prediction）中估计两个相机之间的相对位姿（relative pose）的问题。由于单目深度预测通常存在未知的尺度（scale）和偏移（shift）参数，论文提出了一种新颖的框架，能够同时估计这些参数以及相机位姿。解决方案的关键在于推导了三种情况下的高效求解器：1）两个已标定相机（calibrated cameras）；2）两个未标定相机（uncalibrated cameras）但具有未知但共享的焦距（focal length）；3）两个未标定相机且具有未知且不同的焦距。通过在合成数据和真实数据上的实验，包括使用11种不同的深度预测器生成的深度图，验证了该求解器的实际可行性。与现有工作相比，该求解器在两个大规模真实世界数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.07742
作者: Yaqing Ding,Václav Vávra,Viktor Kocur,Jian Yang,Torsten Sattler,Zuzana Kukelova
机构: Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague(捷克技术大学布拉格分校电气工程学院视觉识别组); Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava(布拉迪斯拉发夸美纽斯大学数学、物理与信息学院); PCA Lab, Nanjing University of Science and Technology, Nanjing, China(南京理工大学PCA实验室); Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague(捷克技术大学布拉格分校信息学、机器人与控制论研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Recent advances in monocular depth prediction have led to significantly improved depth prediction accuracy. In turn, this enables various applications to use such depth predictions. In this paper, we propose a novel framework for estimating the relative pose between two cameras from point correspondences with associated monocular depths. Since depth predictions are typically defined up to an unknown scale and shift parameter, our solvers jointly estimate both scale and shift parameters together with the camera pose. We derive efficient solvers for three cases: (1) two calibrated cameras, (2) two uncalibrated cameras with an unknown but shared focal length, and (3) two uncalibrated cameras with unknown and different focal lengths. Experiments on synthetic and real data, including experiments with depth maps estimated by 11 different depth predictors, show the practical viability of our solvers. Compared to prior work, our solvers achieve state-of-the-art results on two large-scale, real-world datasets. The source code is available at this https URL
zh

[CV-73] Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

【速读】：该论文试图解决现代文本到图像生成模型（text-to-image generative models）中图像分词器（image tokenizer）训练困难的问题，以及现有模型依赖大规模、高质量私有数据集难以复现的挑战。论文提出的解决方案是引入一种高效且强大的图像分词器——Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok)。该分词器在解码阶段（即去分词化，de-tokenization）独特地整合了文本信息，从而加速了收敛并提升了性能。此外，TA-TiTok采用了一种简化但有效的一阶段训练过程，避免了传统一维分词器中复杂的二阶段蒸馏（two-stage distillation）需求，使其能够无缝扩展到大规模数据集。基于此，论文还提出了一系列仅使用开放数据训练的文本到图像掩码生成模型（Masked Generative Models, MaskGen），这些模型在性能上与基于私有数据训练的模型相当。通过发布TA-TiTok分词器和开放数据、开放权重的MaskGen模型，论文旨在推动文本到图像掩码生成模型的广泛访问和民主化。

链接: https://arxiv.org/abs/2501.07730
作者: Dongwon Kim,Ju He,Qihang Yu,Chenglin Yang,Xiaohui Shen,Suha Kwak,Liang-Chieh Chen
机构: ByteDance Seed; POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
zh

[CV-74] sting Human-Hand Segmentation on In-Distribution and Out-of-Distribution Data in Human-Robot Interactions Using a Deep Ensemble Model

【速读】：该论文试图解决在人类-机器人协作中，手部检测和分割（hand detection and segmentation）在分布外（out-of-distribution, OOD）场景下的可靠性问题。当前的研究主要集中在分布内（in-distribution, ID）数据上，而忽略了现实世界中常见的OOD场景，如快速移动的手部导致的运动模糊（motion blur）和手指交叉手势（finger-crossing gestures）等。论文提出了一种新颖的评估方法，通过在ID和OOD场景下评估预训练的深度学习模型（deep learning models）的性能，以更好地模拟真实的工业环境。

解决方案的关键在于设计了一个多样化的数据集，包含简单和复杂背景、不同数量的手部（0到4个）、戴手套和不戴手套的手部，以及OOD场景下的独特条件。此外，研究采用了多视角（multiple points of view, PoVs）的拍摄方式，结合了头戴式（egocentric）摄像头和静态摄像头，以捕捉不同视角下的RGB图像。在分割任务中，使用了基于UNet和RefineNet的深度集成模型（deep ensemble model），并通过分割指标和预测熵（predictive entropy）进行不确定性量化。结果表明，在工业数据集上训练的模型在OOD场景下表现出更好的泛化能力，强调了上下文特定训练的重要性。

链接: https://arxiv.org/abs/2501.07713
作者: Reza Jalayer,Yuxin Chen,Masoud Jalayer,Carlotta Orsenigo,Masayoshi Tomizuka
机构: Department of Management, Economics and Industrial Engineering, Politecnico di Milano (米兰理工大学); Department of Mechanical Engineering, University of California at Berkeley (加州大学伯克利分校); Department of Materials and Mechanical Engineering, University of Turku (图尔库大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Reliable detection and segmentation of human hands are critical for enhancing safety and facilitating advanced interactions in human-robot collaboration. Current research predominantly evaluates hand segmentation under in-distribution (ID) data, which reflects the training data of deep learning (DL) models. However, this approach fails to address out-of-distribution (OOD) scenarios that often arise in real-world human-robot interactions. In this study, we present a novel approach by evaluating the performance of pre-trained DL models under both ID data and more challenging OOD scenarios. To mimic realistic industrial scenarios, we designed a diverse dataset featuring simple and cluttered backgrounds with industrial tools, varying numbers of hands (0 to 4), and hands with and without gloves. For OOD scenarios, we incorporated unique and rare conditions such as finger-crossing gestures and motion blur from fast-moving hands, addressing both epistemic and aleatoric uncertainties. To ensure multiple point of views (PoVs), we utilized both egocentric cameras, mounted on the operator’s head, and static cameras to capture RGB images of human-robot interactions. This approach allowed us to account for multiple camera perspectives while also evaluating the performance of models trained on existing egocentric datasets as well as static-camera datasets. For segmentation, we used a deep ensemble model composed of UNet and RefineNet as base learners. Performance evaluation was conducted using segmentation metrics and uncertainty quantification via predictive entropy. Results revealed that models trained on industrial datasets outperformed those trained on non-industrial datasets, highlighting the importance of context-specific training. Although all models struggled with OOD scenarios, those trained on industrial datasets demonstrated significantly better generalization.
zh

[CV-75] Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights

【速读】：该论文试图解决行人轨迹预测（Pedestrian Trajectory Prediction）中的社会交互建模问题。现有方法依赖于预定义的规则，难以捕捉非显式的社会交互（implicit social interactions），导致预测精度受限。论文提出的解决方案关键是一种名为DTGAN的新框架，该框架将生成对抗网络（Generative Adversarial Networks, GANs）扩展应用于图序列数据（graph sequence data），旨在自动捕捉隐式社会交互并实现精确的行人轨迹预测。DTGAN通过在图中引入随机权重，消除了对预定义交互规则的依赖，并通过在对抗训练中探索多样化的任务损失函数（task loss functions），显著提升了性能，在ADE和FDE指标上分别提高了16.7%和39.3%。实验结果表明，DTGAN在理解行人意图和预测轨迹方面表现出色。

链接: https://arxiv.org/abs/2501.07711
作者: Jiajia Xie,Sheng Zhang,Beihao Xia,Zhu Xiao,Hongbo Jiang,Siwang Zhou,Zheng Qin,Hongyang Chen
机构: College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机与电子工程学院); Shenzhen Research Institute, Hunan University(湖南大学深圳研究院); School of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院); School of Electronic Information and Communication, Huazhong University of Science and Technology(华中科技大学电子信息与通信学院); Zhejiang Lab(之江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 13 pages,7 figures,Accepted to IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Pedestrian trajectory prediction is a critical technology in the evolution of self-driving cars toward complete artificial intelligence. Over recent years, focusing on the trajectories of pedestrians to model their social interactions has surged with great interest in more accurate trajectory predictions. However, existing methods for modeling pedestrian social interactions rely on pre-defined rules, struggling to capture non-explicit social interactions. In this work, we propose a novel framework named DTGAN, which extends the application of Generative Adversarial Networks (GANs) to graph sequence data, with the primary objective of automatically capturing implicit social interactions and achieving precise predictions of pedestrian trajectory. DTGAN innovatively incorporates random weights within each graph to eliminate the need for pre-defined interaction rules. We further enhance the performance of DTGAN by exploring diverse task loss functions during adversarial training, which yields improvements of 16.7% and 39.3% on metrics ADE and FDE, respectively. The effectiveness and accuracy of our framework are verified on two public datasets. The experimental results show that our proposed DTGAN achieves superior performance and is well able to understand pedestrians’ intentions.
zh

[CV-76] C2PD: Continuity-Constrained Pixelwise Deformation for Guided Depth Super-Resolution

【速读】：该论文试图解决引导深度超分辨率（Guided Depth Super-Resolution, GDSR）中深度图（depth map）连续性恢复不足的问题。现有方法通常将深度图视为图像，离散地计算阴影值，导致难以有效恢复深度图固有的连续性。为此，论文提出了一种新颖的方法，通过将GDSR问题转化为具有理想可塑性的粗糙物体的形变问题，最大化利用深度的空间特性并结合人类对现实世界物质的抽象感知。解决方案的关键在于设计了一种跨模态操作——连续性约束的非对称像素操作（Continuity-constrained Asymmetrical Pixelwise Operation, CAPO），该操作能够模拟外力作用下等体积柔性物体的形变过程。基于CAPO，进一步开发了像素级交叉梯度形变（Pixelwise Cross Gradient Deformation, PCGD），能够模拟理想塑性物体（无体积约束）的操作。该方法在四个广泛采用的GDSR基准测试中展示了最先进的性能，尤其在大规模任务和泛化能力方面具有显著优势。

链接: https://arxiv.org/abs/2501.07688
作者: Jiahui Kang,Qing Cai,Runqing Tan,Yimei Liu,Zhi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Guided depth super-resolution (GDSR) has demonstrated impressive performance across a wide range of domains, with numerous methods being proposed. However, existing methods often treat depth maps as images, where shading values are computed discretely, making them struggle to effectively restore the continuity inherent in the depth map. In this paper, we propose a novel approach that maximizes the utilization of spatial characteristics in depth, coupled with human abstract perception of real-world substance, by transforming the GDSR issue into deformation of a roughcast with ideal plasticity, which can be deformed by force like a continuous object. Specifically, we firstly designed a cross-modal operation, Continuity-constrained Asymmetrical Pixelwise Operation (CAPO), which can mimic the process of deforming an isovolumetrically flexible object through external forces. Utilizing CAPO as the fundamental component, we develop the Pixelwise Cross Gradient Deformation (PCGD), which is capable of emulating operations on ideal plastic objects (without volume constraint). Notably, our approach demonstrates state-of-the-art performance across four widely adopted benchmarks for GDSR, with significant advantages in large-scale tasks and generalizability.
zh

[CV-77] Dataset Distillation as Pushforward Optimal Quantization

【速读】：该论文旨在解决数据集蒸馏（Dataset Distillation）问题，即通过生成一个合成训练集，使得在该合成数据上训练模型能够达到与在真实数据上训练相似的性能，同时大幅减少计算需求。现有方法主要分为两类：一类是基于双层优化（bi-level optimization）的方法，其中神经网络训练启发式算法作为下层问题；另一类是解耦方法（disentangled methods），通过匹配数据分布来绕过双层优化。解耦方法在速度和可扩展性方面具有显著优势，特别是在训练和蒸馏数据集规模较大时。论文提出，当解耦方法配备编码器-解码器结构时，可以将其重新表述为最优量化问题，即通过最小化预期投影距离来找到一组有限点以近似底层概率测度。具体而言，论文将现有解耦方法与经典的最优量化和Wasserstein重心问题联系起来，展示了蒸馏数据集在基于扩散的生成先验中的一致性。此外，论文还提出了对当前最先进的数据蒸馏方法D4M的简单扩展，在ImageNet-1K数据集上实现了更好的性能，并在更高图像每类设置下达到了最先进的性能。

链接: https://arxiv.org/abs/2501.07681
作者: Hong Ye Tan,Emma Slade
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose a simple extension of the state-of-the-art data distillation method D4M, achieving better performance on the ImageNet-1K dataset with trivial additional computation, and state-of-the-art performance in higher image-per-class settings.
zh

[CV-78] BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

【速读】：该论文旨在解决现有视频生成模型在处理复杂文本提示和合成多个对象时面临的挑战，特别是缺乏对生成过程的精确控制。为了解决这一问题，论文提出了一种基于视觉基元（visual primitives）的视频分解方法，称为“blob video representation”，并通过开发一个名为BlobGEN-Vid的基于blob条件的视频扩散模型来实现对对象运动和细粒度外观的精确控制。解决方案的关键在于引入了两个创新模块：一是掩码3D注意力模块（masked 3D attention module），用于提高帧间区域一致性；二是可学习的文本嵌入插值模块，使用户能够控制特定帧的语义并实现平滑的对象过渡。此外，BlobGEN-Vid框架具有模型无关性，能够基于U-Net和DiT等不同架构的视频扩散模型进行构建。实验结果表明，BlobGEN-Vid在零样本视频生成能力和布局控制性方面均达到了最先进的水平，尤其是在与大型语言模型（LLM）结合进行布局规划时，其在组合准确性方面甚至优于专有的文本到视频生成器。

链接: https://arxiv.org/abs/2501.07647
作者: Weixi Feng,Chao Liu,Sifei Liu,William Yang Wang,Arash Vahdat,Weili Nie
机构: UC Santa Barbara; NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.
zh

[CV-79] Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets

【速读】：该论文旨在探讨数据集广度（dataset breadth，即受试者数量）和深度（dataset depth，即每个受试者的训练样本数量）对深度学习模型（如孪生神经网络，Siamese Neural Networks）在行为数据中捕捉复杂模式能力的影响。目前，这些影响通常被非正式假设，且缺乏深入的研究。论文通过使用“特征空间”和“密度”概念，对三个公开的击键数据集（Aalto、CMU和Clarkson II）进行了广泛的实验，以分析数据集广度和深度对模型性能的影响。实验结果表明，增加数据集广度有助于训练出能够有效捕捉受试者间差异的模型；而数据集深度的影响则取决于数据集的性质。自由文本数据集受样本数量、序列长度、训练三元组和样本库大小等因素的影响较大，可能导致模型欠拟合；而固定文本数据集对这些因素的敏感性较低，更容易训练出性能良好的模型。这些发现为行为生物识别领域深度学习模型的数据集设计提供了重要见解，并为设计更有效的认证系统提供了指导。

链接: https://arxiv.org/abs/2501.07600
作者: Ahmed Anu Wahab,Daqing Hou,Nadia Cheng,Parker Huntley,Charles Devlen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Deep learning models, such as the Siamese Neural Networks (SNN), have shown great potential in capturing the intricate patterns in behavioral data. However, the impacts of dataset breadth (i.e., the number of subjects) and depth (e.g., the amount of training samples per subject) on the performance of these models is often informally assumed, and remains under-explored. To this end, we have conducted extensive experiments using the concepts of “feature space” and “density” to guide and gain deeper understanding on the impact of dataset breadth and depth on three publicly available keystroke datasets (Aalto, CMU and Clarkson II). Through varying the number of training subjects, number of samples per subject, amount of data in each sample, and number of triplets used in training, we found that when feasible, increasing dataset breadth enables the training of a well-trained model that effectively captures more inter-subject variability. In contrast, we find that the extent of depth’s impact from a dataset depends on the nature of the dataset. Free-text datasets are influenced by all three depth-wise factors; inadequate samples per subject, sequence length, training triplets and gallery sample size, which may all lead to an under-trained model. Fixed-text datasets are less affected by these factors, and as such make it easier to create a well-trained model. These findings shed light on the importance of dataset breadth and depth in training deep learning models for behavioral biometrics and provide valuable insights for designing more effective authentication systems.
zh

[CV-80] Spin-Weighted Spherical Harmonics for Polarized Light Transport

【速读】：该论文旨在解决在渲染过程中模拟偏振光（polarized light）与材料相互作用时的计算效率和连续性保持问题。传统方法在处理偏振光的复杂反射现象时，尤其是在频率域（frequency-domain）分析和存储复杂光相互作用方面存在不足。论文提出了一种新的方法，称为偏振球谐函数（polarized spherical harmonics, PSH），基于自旋加权球谐函数理论（spin-weighted spherical harmonics theory），以提供旋转不变的斯托克斯矢量场（Stokes vector fields）表示。此外，论文引入了基于PSH的频率域偏振渲染方程和球面卷积（spherical convolution）公式，使得偏振光传输的计算在频率域中几乎可以像逐项乘积一样高效进行。最终，该方法实现了首个实时偏振渲染技术，称为预计算偏振辐射传输（precomputed polarized radiance transfer），能够在复杂反射现象中有效且准确地模拟和再现偏振光相互作用。

链接: https://arxiv.org/abs/2501.07582
作者: Shinyoung Yi,Donggun Kim,Jiwoong Na,Xin Tong,Min H. Kim
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The objective of polarization rendering is to simulate the interaction of light with materials exhibiting polarization-dependent behavior. However, integrating polarization into rendering is challenging and increases computational costs significantly. The primary difficulty lies in efficiently modeling and computing the complex reflection phenomena associated with polarized light. Specifically, frequency-domain analysis, essential for efficient environment lighting and storage of complex light interactions, is lacking. To efficiently simulate and reproduce polarized light interactions using frequency-domain techniques, we address the challenge of maintaining continuity in polarized light transport represented by Stokes vectors within angular domains. The conventional spherical harmonics method cannot effectively handle continuity and rotation invariance for Stokes vectors. To overcome this, we develop a new method called polarized spherical harmonics (PSH) based on the spin-weighted spherical harmonics theory. Our method provides a rotation-invariant representation of Stokes vector fields. Furthermore, we introduce frequency domain formulations of polarized rendering equations and spherical convolution based on PSH. We first define spherical convolution on Stokes vector fields in the angular domain, and it also provides efficient computation of polarized light transport, nearly on an entry-wise product in the frequency domain. Our frequency domain formulation, including spherical convolution, led to the development of the first real-time polarization rendering technique under polarized environmental illumination, named precomputed polarized radiance transfer, using our polarized spherical harmonics. Results demonstrate that our method can effectively and accurately simulate and reproduce polarized light interactions in complex reflection phenomena.
zh

[CV-81] DM-Mamba: Dual-domain Multi-scale Mamba for MRI reconstruction

【速读】：该论文试图解决在加速磁共振成像（MRI）重建中，由于k空间（k-space）显著欠采样导致的病态逆问题。现有的深度神经网络（如卷积神经网络（CNNs）和视觉Transformer（ViT））虽然在该任务中表现出显著的性能提升，但面临全局感受野与计算效率之间的权衡问题。为此，论文探索了一种新的长程依赖建模范式——Mamba，旨在实现高效且有效的MRI重建。然而，直接将Mamba应用于MRI重建面临三个主要问题：(1) Mamba的行列扫描方式破坏了k空间的独特频谱特性，未能充分利用其在k空间学习中的潜力；(2) 现有Mamba方法通过多路径长扫描展开特征图，导致长程遗忘和高计算负担；(3) Mamba在处理空间变化内容时表现有限，导致局部表示的多样性不足。

为解决这些问题，论文提出了双域多尺度Mamba的解决方案，关键点包括：(1) 在k空间学习中引入视觉Mamba，并通过定制化的循环扫描方式展开频谱，增强k空间的全局建模能力；(2) 提出一种在图像和k空间域中均采用高效扫描策略的多尺度Mamba，缓解长程遗忘问题，并在效率和性能之间取得更好的平衡；(3) 开发局部多样性增强模块，提升Mamba对空间变化内容的表示能力。实验结果表明，该方法在多种欠采样模式下显著优于现有方法，且计算成本更低。

链接: https://arxiv.org/abs/2501.08163
作者: Yucong Meng,Zhiwei Yang,Zhijian Song,Yonghong Shi
机构: Digital Medical Research Center, School of Basic Medical Science, Fudan University (复旦大学基础医学院数字医学研究中心); Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention (上海医学图像计算与计算机辅助介入重点实验室); Academy of Engineering and Technology, Fudan University (复旦大学工程与应用技术研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The accelerated MRI reconstruction poses a challenging ill-posed inverse problem due to the significant undersampling in k-space. Deep neural networks, such as CNNs and ViT, have shown substantial performance improvements for this task while encountering the dilemma between global receptive fields and efficient computation. To this end, this paper pioneers exploring Mamba, a new paradigm for long-range dependency modeling with linear complexity, for efficient and effective MRI reconstruction. However, directly applying Mamba to MRI reconstruction faces three significant issues: (1) Mamba’s row-wise and column-wise scanning disrupts k-space’s unique spectrum, leaving its potential in k-space learning unexplored. (2) Existing Mamba methods unfold feature maps with multiple lengthy scanning paths, leading to long-range forgetting and high computational burden. (3) Mamba struggles with spatially-varying contents, resulting in limited diversity of local representations. To address these, we propose a dual-domain multi-scale Mamba for MRI reconstruction from the following perspectives: (1) We pioneer vision Mamba in k-space learning. A circular scanning is customized for spectrum unfolding, benefiting the global modeling of k-space. (2) We propose a multi-scale Mamba with an efficient scanning strategy in both image and k-space domains. It mitigates long-range forgetting and achieves a better trade-off between efficiency and performance. (3) We develop a local diversity enhancement module to improve the spatially-varying representation of Mamba. Extensive experiments are conducted on three public datasets for MRI reconstruction under various undersampling patterns. Comprehensive results demonstrate that our method significantly outperforms state-of-the-art methods with lower computational cost. Implementation code will be available at this https URL.
zh

[CV-82] CellOMaps: A Compact Representation for Robust Classification of Lung Adenocarcinoma Growth Patterns

【速读】：该论文旨在解决肺腺癌（LUAD）组织学生长模式分类中的主观性和观察者变异性问题。肺腺癌是一种形态学异质性疾病，具有五种主要组织学生长模式，这些模式与预后直接相关，因此其分类至关重要。然而，现有方法要么仅报告每张切片中的主要模式，要么缺乏适当的评估。论文提出了一种通用的机器学习流程，能够将肺组织分类为五种模式之一或非肿瘤组织。该解决方案的关键在于一种新颖的紧凑细胞组织图（cellOMaps）表示方法，该方法从苏木精和伊红（H&E）全切片图像（WSIs）中捕捉细胞的空间模式。该流程在内部未见切片和外部数据集上均表现出最先进的分类性能，显著优于现有方法。此外，初步结果表明，该模型的输出可用于预测患者的肿瘤突变负荷（TMB）水平。

链接: https://arxiv.org/abs/2501.08094
作者: Arwa Al-Rubaian,Gozde N. Gunesli,Wajd A. Althakfi,Ayesha Azam,David Snead,Nasir M. Rajpoot,Shan E Ahmed Raza
机构: Tissue Image Analytics Centre, Department of Computer Science, University of Warwick, UK(华威大学计算机科学系组织图像分析中心); Histopathology unit, department of Pathology, King Saud University, Riyadh, Kingdom of Saudi Arabia(沙特阿拉伯利雅得国王沙特大学病理学系组织病理学单位); Department of Histopathology, University Hospitals Coventry and Warwickshire NHS Trust, Coventry, UK(英国考文垂和华威大学医院NHS信托组织病理学系); Histofy Ltd, Coventry, UK(英国考文垂Histofy有限公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lung adenocarcinoma (LUAD) is a morphologically heterogeneous disease, characterized by five primary histological growth patterns. The classification of such patterns is crucial due to their direct relation to prognosis but the high subjectivity and observer variability pose a major challenge. Although several studies have developed machine learning methods for growth pattern classification, they either only report the predominant pattern per slide or lack proper evaluation. We propose a generalizable machine learning pipeline capable of classifying lung tissue into one of the five patterns or as non-tumor. The proposed pipeline’s strength lies in a novel compact Cell Organization Maps (cellOMaps) representation that captures the cellular spatial patterns from Hematoxylin and Eosin whole slide images (WSIs). The proposed pipeline provides state-of-the-art performance on LUAD growth pattern classification when evaluated on both internal unseen slides and external datasets, significantly outperforming the current approaches. In addition, our preliminary results show that the model’s outputs can be used to predict patients Tumor Mutational Burden (TMB) levels.
zh

[CV-83] Early prediction of the transferability of bovine embryos from videomicroscopy

【速读】：该论文试图解决通过视频显微镜（videomicroscopy）结合机器学习技术，在体外受精的牛胚胎早期发育阶段预测其可移植性（transferability）的问题。研究的目标是在最多四天内，基于2D延时显微镜视频（2D time-lapse microscopy videos）预测胚胎是否适合移植。该问题被形式化为一个监督学习的二分类问题，类别为“可移植”和“不可移植”。研究面临的主要挑战包括：1）胚胎外观和运动特征的区分度低；2）类别模糊性；3）标注数据量有限。为解决这些问题，论文提出了一种基于3D卷积神经网络（3D convolutional neural network）的多尺度时间模型，该模型通过三条路径（three pathways）分别处理外观和运动特征，并采用焦点损失函数（focal loss）进行训练。该模型命名为SFR，实验结果表明其在处理这一具有挑战性的生物任务时表现出较高的有效性和准确性。

链接: https://arxiv.org/abs/2501.07945
作者: Yasmine Hachani(LACODAM),Patrick Bouthemy(SAIRPICO),Elisa Fromont(LACODAM),Sylvie Ruffini(UVSQ, INRAE),Ludivine Laffont(UVSQ, INRAE),Alline de Paula Reis(UVSQ, INRAE, ENVA)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Accepted at the 2024 IEEE International Conference on Image Processing

点击查看摘要

Abstract:Videomicroscopy is a promising tool combined with machine learning for studying the early development of in vitro fertilized bovine embryos and assessing its transferability as soon as possible. We aim to predict the embryo transferability within four days at most, taking 2D time-lapse microscopy videos as input. We formulate this problem as a supervised binary classification problem for the classes transferable and not transferable. The challenges are three-fold: 1) poorly discriminating appearance and motion, 2) class ambiguity, 3) small amount of annotated data. We propose a 3D convolutional neural network involving three pathways, which makes it multi-scale in time and able to handle appearance and motion in different ways. For training, we retain the focal loss. Our model, named SFR, compares favorably to other methods. Experiments demonstrate its effectiveness and accuracy for our challenging biological task.
zh

[CV-84] An Intra- and Cross-frame Topological Consistency Scheme for Semi-supervised Atherosclerotic Coronary Plaque Segmentation ICASSP2025

【速读】：该论文旨在解决从CT血管造影（CTA）图像中精确分割冠状动脉粥样硬化斑块（coronary atherosclerotic plaques）的难题，这一任务对于高级冠状动脉粥样硬化分析（CAA）至关重要。由于斑块和血管的边界和结构不清晰，现有的深度学习模型表现不佳，且复杂数据的标注难度较大。为解决这些问题，论文提出了一种新颖的双一致性半监督框架（dual-consistency semi-supervised framework），该框架结合了帧内拓扑一致性（Intra-frame Topological Consistency, ITC）和跨帧拓扑一致性（Cross-frame Topological Consistency, CTC）。ITC通过双任务网络同时预测分割掩码和骨架感知距离变换（Skeleton-aware Distance Transform, SDT），利用一致性约束实现拓扑结构的相似预测，而无需额外标注。CTC则通过无监督估计器分析相邻帧之间骨架和边界的像素流，确保空间连续性。实验结果表明，该方法在CAA任务上超越了现有的半监督方法，并接近全监督方法的性能，同时在ACDC数据集上也表现出更好的泛化能力。

链接: https://arxiv.org/abs/2501.07850
作者: Ziheng Zhang,Zihan Li,Dandan Shan,Yuehui Qiu,Qingqi Hong,Qingqiang Wu
机构: Center for Digital Media Computing, School of Film, School of Informatics, Xiamen University (厦门大学数字媒体计算中心, 电影学院, 信息学院); University of Washington (华盛顿大学); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院); National Institute for Data Science in Health and Medicine, Xiamen University (厦门大学健康与医学数据科学研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Enhancing the precision of segmenting coronary atherosclerotic plaques from CT Angiography (CTA) images is pivotal for advanced Coronary Atherosclerosis Analysis (CAA), which distinctively relies on the analysis of vessel cross-section images reconstructed via Curved Planar Reformation. This task presents significant challenges due to the indistinct boundaries and structures of plaques and blood vessels, leading to the inadequate performance of current deep learning models, compounded by the inherent difficulty in annotating such complex data. To address these issues, we propose a novel dual-consistency semi-supervised framework that integrates Intra-frame Topological Consistency (ITC) and Cross-frame Topological Consistency (CTC) to leverage labeled and unlabeled data. ITC employs a dual-task network for simultaneous segmentation mask and Skeleton-aware Distance Transform (SDT) prediction, achieving similar prediction of topology structure through consistency constraint without additional annotations. Meanwhile, CTC utilizes an unsupervised estimator for analyzing pixel flow between skeletons and boundaries of adjacent frames, ensuring spatial continuity. Experiments on two CTA datasets show that our method surpasses existing semi-supervised methods and approaches the performance of supervised methods on CAA. In addition, our method also performs better than other methods on the ACDC dataset, demonstrating its generalization.
zh

人工智能

[AI-0] ADAM-1: AI and Bioinformatics for Alzheimers Detection and Microbiome-Clinical Data Integrations

链接: https://arxiv.org/abs/2501.08324
作者: Ziyuan Huang,Vishaldeep Kaur Sekhon,Ouyang Guo,Mark Newman,Roozbeh Sadeghian,Maria L. Vaida,Cynthia Jo,Doyle Ward,Vanni Bucci,John P. Haran
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 16 figures

点击查看摘要

Abstract:The Alzheimer’s Disease Analysis Model Generation 1 (ADAM) is a multi-agent large language model (LLM) framework designed to integrate and analyze multi-modal data, including microbiome profiles, clinical datasets, and external knowledge bases, to enhance the understanding and detection of Alzheimer’s disease (AD). By leveraging retrieval-augmented generation (RAG) techniques along with its multi-agent architecture, ADAM-1 synthesizes insights from diverse data sources and contextualizes findings using literature-driven evidence. Comparative evaluation against XGBoost revealed similar mean F1 scores but significantly reduced variance for ADAM-1, highlighting its robustness and consistency, particularly in small laboratory datasets. While currently tailored for binary classification tasks, future iterations aim to incorporate additional data modalities, such as neuroimaging and biomarkers, to broaden the scalability and applicability for Alzheimer’s research and diagnostics.

[AI-1] Polynomial Threshold Functions of Bounded Tree-Width: Some Explainability and Complexity Aspects

链接: https://arxiv.org/abs/2501.08297
作者: Karine Chubarian,Johnny Joyce,Gyorgy Turan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 22 pages, 3 figures. To be published in Festschrift in honor of Johann A. Makowsky

点击查看摘要

Abstract:The tree-width of a multivariate polynomial is the tree-width of the hypergraph with hyperedges corresponding to its terms. Multivariate polynomials of bounded tree-width have been studied by Makowsky and Meer as a new sparsity condition that allows for polynomial solvability of problems which are intractable in general. We consider a variation on this theme for Boolean variables. A representation of a Boolean function as the sign of a polynomial is called a polynomial threshold representation. We discuss Boolean functions representable as polynomial threshold functions of bounded tree-width and present two applications to Bayesian network classifiers, a probabilistic graphical model. Both applications are in Explainable Artificial Intelligence (XAI), the research area dealing with the black-box nature of many recent machine learning models. We also give a separation result between the representational power of positive and general polynomial threshold functions.

[AI-2] Engineering LLM Powered Multi-agent Framework for Autonomous CloudOps ICSE2025 ICSE-2025

链接: https://arxiv.org/abs/2501.08243
作者: Kannan Parthasarathy,Karthik Vaidhyanathan,Rudra Dhar,Venkat Krishnamachari,Basil Muhammed,Adyansh Kakran,Sreemaee Akshathala,Shrikara Arun,Sumant Dubey,Mohan Veerubhotla,Amey Karan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: The paper has been accepted as full paper to CAIN 2025 ( this https URL ), co-located with ICSE 2025 ( this https URL ). The paper was submitted to CAIN for review on 9 November 2024

点击查看摘要

Abstract:Cloud Operations (CloudOps) is a rapidly growing field focused on the automated management and optimization of cloud infrastructure which is essential for organizations navigating increasingly complex cloud environments. MontyCloud Inc. is one of the major companies in the CloudOps domain that leverages autonomous bots to manage cloud compliance, security, and continuous operations. To make the platform more accessible and effective to the customers, we leveraged the use of GenAI. Developing a GenAI-based solution for autonomous CloudOps for the existing MontyCloud system presented us with various challenges such as i) diverse data sources; ii) orchestration of multiple processes; and iii) handling complex workflows to automate routine tasks. To this end, we developed MOYA, a multi-agent framework that leverages GenAI and balances autonomy with the necessary human control. This framework integrates various internal and external systems and is optimized for factors like task orchestration, security, and error mitigation while producing accurate, reliable, and relevant insights by utilizing Retrieval Augmented Generation (RAG). Evaluations of our multi-agent system with the help of practitioners as well as using automated checks demonstrate enhanced accuracy, responsiveness, and effectiveness over non-agentic approaches across complex workflows. Comments: The paper has been accepted as full paper to CAIN 2025 (this https URL), co-located with ICSE 2025 (this https URL). The paper was submitted to CAIN for review on 9 November 2024 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.08243 [cs.SE] (or arXiv:2501.08243v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.08243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-3] Dynamic Pricing in High-Speed Railways Using Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2501.08234
作者: Enrique Adrian Villarrubia-Martin,Luis Rodriguez-Benitez,David Muñoz-Valero,Giovanni Montana,Luis Jimenez-Linares
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 37 pages, 5 figures

点击查看摘要

Abstract:This paper addresses a critical challenge in the high-speed passenger railway industry: designing effective dynamic pricing strategies in the context of competing and cooperating operators. To address this, a multi-agent reinforcement learning (MARL) framework based on a non-zero-sum Markov game is proposed, incorporating random utility models to capture passenger decision making. Unlike prior studies in areas such as energy, airlines, and mobile networks, dynamic pricing for railway systems using deep reinforcement learning has received limited attention. A key contribution of this paper is a parametrisable and versatile reinforcement learning simulator designed to model a variety of railway network configurations and demand patterns while enabling realistic, microscopic modelling of user behaviour, called RailPricing-RL. This environment supports the proposed MARL framework, which models heterogeneous agents competing to maximise individual profits while fostering cooperative behaviour to synchronise connecting services. Experimental results validate the framework, demonstrating how user preferences affect MARL performance and how pricing policies influence passenger choices, utility, and overall system dynamics. This study provides a foundation for advancing dynamic pricing strategies in railway systems, aligning profitability with system-wide efficiency, and supporting future research on optimising pricing policies.

[AI-4] Optimization of Link Configuration for Satellite Communication Using Reinforcement Learning

链接: https://arxiv.org/abs/2501.08220
作者: Tobias Rohe,Michael Kölle,Jan Matheis,Rüdiger Höpfl,Leo Sünkel,Claudia Linnhoff-Popien
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Satellite communication is a key technology in our modern connected world. With increasingly complex hardware, one challenge is to efficiently configure links (connections) on a satellite transponder. Planning an optimal link configuration is extremely complex and depends on many parameters and metrics. The optimal use of the limited resources, bandwidth and power of the transponder is crucial. Such an optimization problem can be approximated using metaheuristic methods such as simulated annealing, but recent research results also show that reinforcement learning can achieve comparable or even better performance in optimization methods. However, there have not yet been any studies on link configuration on satellite transponders. In order to close this research gap, a transponder environment was developed as part of this work. For this environment, the performance of the reinforcement learning algorithm PPO was compared with the metaheuristic simulated annealing in two experiments. The results show that Simulated Annealing delivers better results for this static problem than the PPO algorithm, however, the research in turn also underlines the potential of reinforcement learning for optimization problems.

[AI-5] Modeling Feature Maps for Quantum Machine Learning

链接: https://arxiv.org/abs/2501.08205
作者: Navneet Singh,Shiva Raj Pokhrel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) offers significant potential for complex tasks like genome sequence classification, but quantum noise on Noisy Intermediate-Scale Quantum (NISQ) devices poses practical challenges. This study systematically evaluates how various quantum noise models including dephasing, amplitude damping, depolarizing, thermal noise, bit-flip, and phase-flip affect key QML algorithms (QSVC, Peg-QSVC, QNN, VQC) and feature mapping techniques (ZFeatureMap, ZZFeatureMap, and PauliFeatureMap). Results indicate that QSVC is notably robust under noise, whereas Peg-QSVC and QNN are more sensitive, particularly to depolarizing and amplitude-damping noise. The PauliFeatureMap is especially vulnerable, highlighting difficulties in maintaining accurate classification under noisy conditions. These findings underscore the critical importance of feature map selection and noise mitigation strategies in optimizing QML for genomic classification, with promising implications for personalized medicine.

[AI-6] PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

链接: https://arxiv.org/abs/2501.08192
作者: Ahmet Caner Yüzügüler,Jiawei Zhuang,Lukas Cavigelli
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges, particularly in terms of HBM bandwidth bottlenecks and inter-device communication overhead. In this paper, we present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.

[AI-7] Assessing AI Adoption and Digitalization in SMEs: A Framework for Implementation

链接: https://arxiv.org/abs/2501.08184
作者: Serena Proietti,Roberto Magnani
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The primary objective of this research is to examine the current state of digitalization and the integration of artificial intelligence (AI) within small and medium-sized enterprises (SMEs) in Italy. There is a significant gap between SMEs and large corporations in their use of AI, with SMEs facing numerous barriers to adoption. This study identifies critical drivers and obstacles to achieving intelligent transformation, proposing a framework model to address key challenges and provide actionable guidelines

[AI-8] LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking

链接: https://arxiv.org/abs/2501.08168
作者: Yukai Ma,Tiantian Wei,Naiting Zhong,Jianbiao Mei,Tao Hu,Licheng Wen,Xuemeng Yang,Botian Shi,Yong Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes - including appearance, motion patterns, and associated risks - LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module miming the human-driving learning process. The system consists of an Analytic Process (System-II) that accumulates driving experience through logical reasoning and a Heuristic Process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared to camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: this https URL.

[AI-9] I Can Find You in Seconds! Leverag ing Large Language Models for Code Authorship Attribution

链接: https://arxiv.org/abs/2501.08165
作者: Soohyeon Choi,Yong Kiam Tan,Mark Huasong Meng,Mohamed Ragab,Soumik Mondal,David Mohaisen,Khin Mi Mi Aung
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 5 figures,

点击查看摘要

Abstract:Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using large language models (LLMs), which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution. We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks. Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering. Comments: 12 pages, 5 figures, Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.08165 [cs.SE] (or arXiv:2501.08165v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.08165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] FairTTTS: A Tree Test Time Simulation Method for Fairness-Aware Classification

链接: https://arxiv.org/abs/2501.08155
作者: Nurit Cohen-Inger,Lior Rokach,Bracha Shapira,Seffi Cohen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Algorithmic decision-making has become deeply ingrained in many domains, yet biases in machine learning models can still produce discriminatory outcomes, often harming unprivileged groups. Achieving fair classification is inherently challenging, requiring a careful balance between predictive performance and ethical considerations. We present FairTTTS, a novel post-processing bias mitigation method inspired by the Tree Test Time Simulation (TTTS) method. Originally developed to enhance accuracy and robustness against adversarial inputs through probabilistic decision-path adjustments, TTTS serves as the foundation for FairTTTS. By building on this accuracy-enhancing technique, FairTTTS mitigates bias and improves predictive performance. FairTTTS uses a distance-based heuristic to adjust decisions at protected attribute nodes, ensuring fairness for unprivileged samples. This fairness-oriented adjustment occurs as a post-processing step, allowing FairTTTS to be applied to pre-trained models, diverse datasets, and various fairness metrics without retraining. Extensive evaluation on seven benchmark datasets shows that FairTTTS outperforms traditional methods in fairness improvement, achieving a 20.96% average increase over the baseline compared to 18.78% for related work, and further enhances accuracy by 0.55%. In contrast, competing methods typically reduce accuracy by 0.42%. These results confirm that FairTTTS effectively promotes more equitable decision-making while simultaneously improving predictive performance.

[AI-11] Multiple-Input Variational Auto-Encoder for Anomaly Detection in Heterogeneous Data

链接: https://arxiv.org/abs/2501.08149
作者: Phai Vu Dinh,Diep N. Nguyen,Dinh Thai Hoang,Quang Uy Nguyen,Eryk Dutkiewicz
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages

点击查看摘要

Abstract:Anomaly detection (AD) plays a pivotal role in AI applications, e.g., in classification, and intrusion/threat detection in cybersecurity. However, most existing methods face challenges of heterogeneity amongst feature subsets posed by non-independent and identically distributed (non-IID) data. We propose a novel neural network model called Multiple-Input Auto-Encoder for AD (MIAEAD) to address this. MIAEAD assigns an anomaly score to each feature subset of a data sample to indicate its likelihood of being an anomaly. This is done by using the reconstruction error of its sub-encoder as the anomaly score. All sub-encoders are then simultaneously trained using unsupervised learning to determine the anomaly scores of feature subsets. The final AUC of MIAEAD is calculated for each sub-dataset, and the maximum AUC obtained among the sub-datasets is selected. To leverage the modelling of the distribution of normal data to identify anomalies of the generative models, we develop a novel neural network architecture/model called Multiple-Input Variational Auto-Encoder (MIVAE). MIVAE can process feature subsets through its sub-encoders before learning distribution of normal data in the latent space. This allows MIVAE to identify anomalies that deviate from the learned distribution. We theoretically prove that the difference in the average anomaly score between normal samples and anomalies obtained by the proposed MIVAE is greater than that of the Variational Auto-Encoder (VAEAD), resulting in a higher AUC for MIVAE. Extensive experiments on eight real-world anomaly datasets demonstrate the superior performance of MIAEAD and MIVAE over conventional methods and the state-of-the-art unsupervised models, by up to 6% in terms of AUC score. Alternatively, MIAEAD and MIVAE have a high AUC when applied to feature subsets with low heterogeneity based on the coefficient of variation (CV) score.

[AI-12] Data-driven inventory management for new products: A warm-start and adjusted Dyna-Q approach

链接: https://arxiv.org/abs/2501.08109
作者: Xinyu Qu,Longxiao Liu,Wenjie Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:In this paper, we propose a novel reinforcement learning algorithm for inventory management of newly launched products with no or limited historical demand information. The algorithm follows the classic Dyna- Q structure, balancing the model-based and model-free approaches, while accelerating the training process of Dyna- Q and mitigating the model discrepancy generated by the model-based feedback. Warm-start information from the demand data of existing similar products can be incorporated into the algorithm to further stabilize the early-stage training and reduce the variance of the estimated optimal policy. Our approach is validated through a case study of bakery inventory management with real data. The adjusted Dyna- Q shows up to a 23.7% reduction in average daily cost compared with Q -learning, and up to a 77.5% reduction in training time within the same horizon compared with classic Dyna- Q . By incorporating the warm-start information, it can be found that the adjusted Dyna- Q has the lowest total cost, lowest variance in total cost, and relatively low shortage percentages among all the algorithms under a 30-day testing.

[AI-13] Hybrid Action Based Reinforcement Learning for Multi-Objective Compatible Autonomous Driving

链接: https://arxiv.org/abs/2501.08096
作者: Guizhe Jin,Zhuoren Li,Bo Leng,Wei Han,Lu Xiong,Chen Sun
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown excellent performance in solving decision-making and control problems of autonomous driving, which is increasingly applied in diverse driving scenarios. However, driving is a multi-attribute problem, leading to challenges in achieving multi-objective compatibility for current RL methods, especially in both policy execution and policy iteration. On the one hand, the common action space structure with single action type limits driving flexibility or results in large behavior fluctuations during policy execution. On the other hand, the multi-attribute weighted single reward function result in the agent’s disproportionate attention to certain objectives during policy iterations. To this end, we propose a Multi-objective Ensemble-Critic reinforcement learning method with Hybrid Parametrized Action for multi-objective compatible autonomous driving. Specifically, a parameterized action space is constructed to generate hybrid driving actions, combining both abstract guidance and concrete control commands. A multi-objective critics architecture is constructed considering multiple attribute rewards, to ensure simultaneously focusing on different driving objectives. Additionally, uncertainty-based exploration strategy is introduced to help the agent faster approach viable driving policy. The experimental results in both the simulated traffic environment and the HighD dataset demonstrate that our method can achieve multi-objective compatible autonomous driving in terms of driving efficiency, action consistency, and safety. It enhances the general performance of the driving while significantly increasing training efficiency.

[AI-14] Hierarchical Autoscaling for Large Language Model Serving with Chiron

链接: https://arxiv.org/abs/2501.08090
作者: Archit Patke,Dhemath Reddy,Saurabh Jha,Chandra Narayanaswami,Zbigniew Kalbarczyk,Ravishankar Iyer
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.

[AI-15] NOMTO: Neural Operator-based symbolic Model approximaTion and discOvery

链接: https://arxiv.org/abs/2501.08086
作者: Sergei Garmaev,Siddhartha Mishra,Olga Fink
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:While many physical and engineering processes are most effectively described by non-linear symbolic models, existing non-linear symbolic regression (SR) methods are restricted to a limited set of continuous algebraic functions, thereby limiting their applicability to discover higher order non-linear differential relations. In this work, we introduce the Neural Operator-based symbolic Model approximaTion and discOvery (NOMTO) method, a novel approach to symbolic model discovery that leverages Neural Operators to encompass a broad range of symbolic operations. We demonstrate that NOMTO can successfully identify symbolic expressions containing elementary functions with singularities, special functions, and derivatives. Additionally, our experiments demonstrate that NOMTO can accurately rediscover second-order non-linear partial differential equations. By broadening the set of symbolic operations available for discovery, NOMTO significantly advances the capabilities of existing SR methods. It provides a powerful and flexible tool for model discovery, capable of capturing complex relations in a variety of physical systems.

[AI-16] Artificial Liver Classifier: A New Alternative to Conventional Machine Learning Models

链接: https://arxiv.org/abs/2501.08074
作者: Mahmood A. Jumaah,Yossra H. Ali,Tarik A. Rashid
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages

点击查看摘要

Abstract:Supervised machine learning classifiers often encounter challenges related to performance, accuracy, and overfitting. This paper introduces the Artificial Liver Classifier (ALC), a novel supervised learning classifier inspired by the human liver’s detoxification function. The ALC is characterized by its simplicity, speed, hyperparameters-free, ability to reduce overfitting, and effectiveness in addressing multi-classification problems through straightforward mathematical operations. To optimize the ALC’s parameters, an improved FOX optimization algorithm (IFOX) is employed as the training method. The proposed ALC was evaluated on five benchmark machine learning datasets: Iris Flower, Breast Cancer Wisconsin, Wine, Voice Gender, and MNIST. The results demonstrated competitive performance, with the ALC achieving 100% accuracy on the Iris dataset, surpassing logistic regression, multilayer perceptron, and support vector machine. Similarly, on the Breast Cancer dataset, it achieved 99.12% accuracy, outperforming XGBoost and logistic regression. Across all datasets, the ALC consistently exhibited lower overfitting gaps and loss compared to conventional classifiers. These findings highlight the potential of leveraging biological process simulations to develop efficient machine learning models and open new avenues for innovation in the field.

[AI-17] A Roadmap to Guide the Integration of LLM s in Hierarchical Planning AAAI

链接: https://arxiv.org/abs/2501.08068
作者: Israel Puerta-Merino,Carlos Núñez-Molina,Pablo Mesejo,Juan Fernández-Olivares
类目: Artificial Intelligence (cs.AI)
*备注: 5 pages, 0 figures, to be published in the AAAI Workshop on Planning in the Era of LLMs ( this https URL )

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) are fostering their integration into several reasoning-related fields, including Automated Planning (AP). However, their integration into Hierarchical Planning (HP), a subfield of AP that leverages hierarchical knowledge to enhance planning performance, remains largely unexplored. In this preliminary work, we propose a roadmap to address this gap and harness the potential of LLMs for HP. To this end, we present a taxonomy of integration methods, exploring how LLMs can be utilized within the HP life cycle. Additionally, we provide a benchmark with a standardized dataset for evaluating the performance of future LLM-based HP approaches, and present initial results for a state-of-the-art HP planner and LLM planner. As expected, the latter exhibits limited performance (3% correct plans, and none with a correct hierarchical decomposition) but serves as a valuable baseline for future approaches.

[AI-18] Building Symbiotic AI: Reviewing the AI Act for a Human-Centred Principle-Based Framework

链接: https://arxiv.org/abs/2501.08046
作者: Miriana Calvano(1),Antonio Curci(1),Giuseppe Desolda(1),Andrea Esposito(1),Rosa Lanzilotti(1),Antonio Piccinno(1) ((1) Department of Computer Science, University of Bari Aldo Moro, Bari, Italy)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: First version: 17 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Artificial Intelligence (AI) spreads quickly as new technologies and services take over modern society. The need to regulate AI design, development, and use is strictly necessary to avoid unethical and potentially dangerous consequences to humans. The European Union (EU) has released a new legal framework, the AI Act, to regulate AI by undertaking a risk-based approach to safeguard humans during interaction. At the same time, researchers offer a new perspective on AI systems, commonly known as Human-Centred AI (HCAI), highlighting the need for a human-centred approach to their design. In this context, Symbiotic AI (a subtype of HCAI) promises to enhance human capabilities through a deeper and continuous collaboration between human intelligence and AI. This article presents the results of a Systematic Literature Review (SLR) that aims to identify principles that characterise the design and development of Symbiotic AI systems while considering humans as the core of the process. Through content analysis, four principles emerged from the review that must be applied to create Human-Centred AI systems that can establish a symbiotic relationship with humans. In addition, current trends and challenges were defined to indicate open questions that may guide future research for the development of SAI systems that comply with the AI Act.

[AI-19] Cooperative Patrol Routing: Optimizing Urban Crime Surveillance through Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2501.08020
作者: Juan Palma-Borda,Eduardo Guzmán,María-Victoria Belmonte
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The effective design of patrol strategies is a difficult and complex problem, especially in medium and large areas. The objective is to plan, in a coordinated manner, the optimal routes for a set of patrols in a given area, in order to achieve maximum coverage of the area, while also trying to minimize the number of patrols. In this paper, we propose a multi-agent reinforcement learning (MARL) model, based on a decentralized partially observable Markov decision process, to plan unpredictable patrol routes within an urban environment represented as an undirected graph. The model attempts to maximize a target function that characterizes the environment within a given time frame. Our model has been tested to optimize police patrol routes in three medium-sized districts of the city of Malaga. The aim was to maximize surveillance coverage of the most crime-prone areas, based on actual crime data in the city. To address this problem, several MARL algorithms have been studied, and among these the Value Decomposition Proximal Policy Optimization (VDPPO) algorithm exhibited the best performance. We also introduce a novel metric, the coverage index, for the evaluation of the coverage performance of the routes generated by our model. This metric is inspired by the predictive accuracy index (PAI), which is commonly used in criminology to detect hotspots. Using this metric, we have evaluated the model under various scenarios in which the number of agents (or patrols), their starting positions, and the level of information they can observe in the environment have been modified. Results show that the coordinated routes generated by our model achieve a coverage of more than 90% of the 3% of graph nodes with the highest crime incidence, and 65% for 20% of these nodes; 3% and 20% represent the coverage standards for police resource allocation.

[AI-20] An AI-driven framework for rapid and localized optimizations of urban open spaces

链接: https://arxiv.org/abs/2501.08019
作者: Pegah Eshraghi,Arman Nikkhah Dehnavi,Maedeh Mirdamadi,Riccardo Talami,Zahra-Sadat Zomorodian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 36 pages

点击查看摘要

Abstract:As urbanization accelerates, open spaces are increasingly recognized for their role in enhancing sustainability and well-being, yet they remain underexplored compared to built spaces. This study introduces an AI-driven framework that integrates machine learning models (MLMs) and explainable AI techniques to optimize Sky View Factor (SVF) and visibility, key spatial metrics influencing thermal comfort and perceived safety in urban spaces. Unlike global optimization methods, which are computationally intensive and impractical for localized adjustments, this framework supports incremental design improvements with lower computational costs and greater flexibility. The framework employs SHapley Adaptive Explanations (SHAP) to analyze feature importance and Counterfactual Explanations (CFXs) to propose minimal design changes. Simulations tested five MLMs, identifying XGBoost as the most accurate, with building width, park area, and heights of surrounding buildings as critical for SVF, and distances from southern buildings as key for visibility. Compared to Genetic Algorithms, which required approximately 15/30 minutes across 3/4 generations to converge, the tested CFX approach achieved optimized results in 1 minute with a 5% RMSE error, demonstrating significantly faster performance and suitability for scalable retrofitting strategies. This interpretable and computationally efficient framework advances urban performance optimization, providing data-driven insights and practical retrofitting solutions for enhancing usability and environmental quality across diverse urban contexts.

[AI-21] GDiffRetro: Retrosynthesis Prediction with Dual Graph Enhanced Molecular Representation and Diffusion Generation

链接: https://arxiv.org/abs/2501.08001
作者: Shengyin Sun,Wenhao Yu,Yuxiang Ren,Weitao Du,Liwei Liu,Xuecang Zhang,Ying Hu,Chen Ma
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrosynthesis prediction focuses on identifying reactants capable of synthesizing a target product. Typically, the retrosynthesis prediction involves two phases: Reaction Center Identification and Reactant Generation. However, we argue that most existing methods suffer from two limitations in the two phases: (i) Existing models do not adequately capture the ``face’’ information in molecular graphs for the reaction center identification. (ii) Current approaches for the reactant generation predominantly use sequence generation in a 2D space, which lacks versatility in generating reasonable distributions for completed reactive groups and overlooks molecules’ inherent 3D properties. To overcome the above limitations, we propose GDiffRetro. For the reaction center identification, GDiffRetro uniquely integrates the original graph with its corresponding dual graph to represent molecular structures, which helps guide the model to focus more on the faces in the graph. For the reactant generation, GDiffRetro employs a conditional diffusion model in 3D to further transform the obtained synthon into a complete reactant. Our experimental findings reveal that GDiffRetro outperforms state-of-the-art semi-template models across various evaluative metrics.

[AI-22] LLM -Ehnanced Holonic Architecture for Ad-Hoc Scalable SoS

链接: https://arxiv.org/abs/2501.07992
作者: Muhammad Ashfaq,Ahmed R. Sadik,Tommi Mikkonen,Muhammad Waseem,Niko Mäkitalo
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:As modern system of systems (SoS) become increasingly adaptive and human centred, traditional architectures often struggle to support interoperability, reconfigurability, and effective human system interaction. This paper addresses these challenges by advancing the state of the art holonic architecture for SoS, offering two main contributions to support these adaptive needs. First, we propose a layered architecture for holons, which includes reasoning, communication, and capabilities layers. This design facilitates seamless interoperability among heterogeneous constituent systems by improving data exchange and integration. Second, inspired by principles of intelligent manufacturing, we introduce specialised holons namely, supervisor, planner, task, and resource holons aimed at enhancing the adaptability and reconfigurability of SoS. These specialised holons utilise large language models within their reasoning layers to support decision making and ensure real time adaptability. We demonstrate our approach through a 3D mobility case study focused on smart city transportation, showcasing its potential for managing complex, multimodal SoS environments. Additionally, we propose evaluation methods to assess the architecture efficiency and scalability,laying the groundwork for future empirical validations through simulations and real world implementations.

[AI-23] Comprehensive Metapath-based Heterogeneous Graph Transformer for Gene-Disease Association Prediction

链接: https://arxiv.org/abs/2501.07970
作者: Wentao Cui,Shoubo Li,Chen Fang,Qingqing Long,Chengrui Wang,Xuezhi Wang,Yuanchun Zhou
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:Discovering gene-disease associations is crucial for understanding disease mechanisms, yet identifying these associations remains challenging due to the time and cost of biological experiments. Computational methods are increasingly vital for efficient and scalable gene-disease association prediction. Graph-based learning models, which leverage node features and network relationships, are commonly employed for biomolecular predictions. However, existing methods often struggle to effectively integrate node features, heterogeneous structures, and semantic information. To address these challenges, we propose COmprehensive MEtapath-based heterogeneous graph Transformer(COMET) for predicting gene-disease associations. COMET integrates diverse datasets to construct comprehensive heterogeneous networks, initializing node features with BioGPT. We define seven Metapaths and utilize a transformer framework to aggregate Metapath instances, capturing global contexts and long-distance dependencies. Through intra- and inter-metapath aggregation using attention mechanisms, COMET fuses latent vectors from multiple Metapaths to enhance GDA prediction accuracy. Our method demonstrates superior robustness compared to state-of-the-art approaches. Ablation studies and visualizations validate COMET’s effectiveness, providing valuable insights for advancing human health research.

[AI-24] Derivation of Output Correlation Inferences for Multi-Output (aka Multi-Task) Gaussian Process

链接: https://arxiv.org/abs/2501.07964
作者: Shuhei Watanabe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian process (GP) is arguably one of the most widely used machine learning algorithms in practice. One of its prominent applications is Bayesian optimization (BO). Although the vanilla GP itself is already a powerful tool for BO, it is often beneficial to be able to consider the dependencies of multiple outputs. To do so, Multi-task GP (MTGP) is formulated, but it is not trivial to fully understand the derivations of its formulations and their gradients from the previous literature. This paper serves friendly derivations of the MTGP formulations and their gradients.

[AI-25] Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

链接: https://arxiv.org/abs/2501.07959
作者: Jiaqi Hua,Wanxu Wei
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. (2024) focuses on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search. Nevertheless, this method lacks generality since it specifies the instruction-response structure. Moreover, the reason why inserting special tokens takes effect in inducing harmful behaviors is only empirically discussed. In this paper, we take a deeper insight into the mechanism of special token injection and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model’s vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at this https URL.

[AI-26] Advice for Diabetes Self-Management by ChatGPT Models: Challenges and Recommendations

链接: https://arxiv.org/abs/2501.07931
作者: Waqar Hussain,John Grundy
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given their ability for advanced reasoning, extensive contextual understanding, and robust question-answering abilities, large language models have become prominent in healthcare management research. Despite adeptly handling a broad spectrum of healthcare inquiries, these models face significant challenges in delivering accurate and practical advice for chronic conditions such as diabetes. We evaluate the responses of ChatGPT versions 3.5 and 4 to diabetes patient queries, assessing their depth of medical knowledge and their capacity to deliver personalized, context-specific advice for diabetes self-management. Our findings reveal discrepancies in accuracy and embedded biases, emphasizing the models’ limitations in providing tailored advice unless activated by sophisticated prompting techniques. Additionally, we observe that both models often provide advice without seeking necessary clarification, a practice that can result in potentially dangerous advice. This underscores the limited practical effectiveness of these models without human oversight in clinical settings. To address these issues, we propose a commonsense evaluation layer for prompt evaluation and incorporating disease-specific external memory using an advanced Retrieval Augmented Generation technique. This approach aims to improve information quality and reduce misinformation risks, contributing to more reliable AI applications in healthcare settings. Our findings seek to influence the future direction of AI in healthcare, enhancing both the scope and quality of its integration.

[AI-27] An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures

链接: https://arxiv.org/abs/2501.07930
作者: Thibaut Boissin(IRIT, ANITI),Franck Mamalet,Thomas Fel,Agustin Martin Picard,Thomas Massena(IRIT),Mathieu Serrurier(IRIT, ANITI)
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Orthogonal convolutional layers are the workhorse of multiple areas in machine learning, such as adversarial robustness, normalizing flows, GANs, and Lipschitzconstrained models. Their ability to preserve norms and ensure stable gradient propagation makes them valuable for a large range of problems. Despite their promise, the deployment of orthogonal convolution in large-scale applications is a significant challenge due to computational overhead and limited support for modern features like strides, dilations, group convolutions, and transposed this http URL this paper, we introduce AOC (Adaptative Orthogonal Convolution), a scalable method for constructing orthogonal convolutions, effectively overcoming these limitations. This advancement unlocks the construction of architectures that were previously considered impractical. We demonstrate through our experiments that our method produces expressive models that become increasingly efficient as they scale. To foster further advancement, we provide an open-source library implementing this method, available at this https URL.

[AI-28] Large Language Model Interface for Home Energy Management Systems

链接: https://arxiv.org/abs/2501.07919
作者: François Michelon,Yihong Zhou,Thomas Morstyn
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages conference paper

点击查看摘要

Abstract:Home Energy Management Systems (HEMSs) help households tailor their electricity usage based on power system signals such as energy prices. This technology helps to reduce energy bills and offers greater demand-side flexibility that supports the power system stability. However, residents who lack a technical background may find it difficult to use HEMSs effectively, because HEMSs require well-formatted parameterization that reflects the characteristics of the energy resources, houses, and users’ needs. Recently, Large-Language Models (LLMs) have demonstrated an outstanding ability in language understanding. Motivated by this, we propose an LLM-based interface that interacts with users to understand and parameterize their ``badly-formatted answers’', and then outputs well-formatted parameters to implement an HEMS. We further use Reason and Act method (ReAct) and few-shot prompting to enhance the LLM performance. Evaluating the interface performance requires multiple user–LLM interactions. To avoid the efforts in finding volunteer users and reduce the evaluation time, we additionally propose a method that uses another LLM to simulate users with varying expertise, ranging from knowledgeable to non-technical. By comprehensive evaluation, the proposed LLM-based HEMS interface achieves an average parameter retrieval accuracy of 88%, outperforming benchmark models without ReAct and/or few-shot prompting.

[AI-29] Governing AI Agents

链接: https://arxiv.org/abs/2501.07913
作者: Noam Kolt
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The field of AI is undergoing a fundamental transition from systems that can produce synthetic content upon request to autonomous agents that can plan and execute complex tasks with only limited human involvement. Companies that pioneered the development of generative AI tools are now building AI agents that can be instructed to independently navigate the internet, perform a wide range of online tasks, and serve as artificial personal assistants and virtual coworkers. The opportunities presented by this new technology are tremendous, as are the associated risks. Fortunately, there exist robust analytic frameworks for confronting many of these challenges, namely, the economic theory of principal-agent problems and the common law doctrine of agency relationships. Drawing on these frameworks, this Article makes three contributions. First, it uses agency law and theory to identify and characterize problems arising from AI agents, including issues of information asymmetry, discretionary authority, and loyalty. Second, it illustrates the limitations of conventional solutions to agency problems: incentive design, monitoring, and enforcement might not be effective for governing AI agents that make uninterpretable decisions and operate at unprecedented speed and scale. Third, the Article explores the implications of agency law and theory for designing and regulating AI agents, arguing that new technical and legal infrastructure is needed to support governance principles of inclusivity, visibility, and liability.

[AI-30] Deep Learning and Natural Language Processing in the Field of Construction

链接: https://arxiv.org/abs/2501.07911
作者: Rémy Kessler(LIA),Nicolas Béchet(IRISA, EXPRESSION, UBS Vannes)
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article presents a complete process to extract hypernym relationships in the field of construction using two main steps: terminology extraction and detection of hypernyms from these terms. We first describe the corpus analysis method to extract terminology from a collection of technical specifications in the field of construction. Using statistics and word n-grams analysis, we extract the domain’s terminology and then perform pruning steps with linguistic patterns and internet queries to improve the quality of the final terminology. Second, we present a machine-learning approach based on various words embedding models and combinations to deal with the detection of hypernyms from the extracted terminology. Extracted terminology is evaluated using a manual evaluation carried out by 6 experts in the domain, and the hypernym identification method is evaluated with different datasets. The global approach provides relevant and promising results.

[AI-31] Logarithmic Memory Networks (LMNs): Efficient Long-Range Sequence Modeling for Resource-Constrained Environments

链接: https://arxiv.org/abs/2501.07905
作者: Mohamed A. Taha
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 18 pages, 10 figures

点击查看摘要

Abstract:Long-range sequence modeling is a crucial aspect of natural language processing and time series analysis. However, traditional models like Recurrent Neural Networks (RNNs) and Transformers suffer from computational and memory inefficiencies, especially when dealing with long sequences. This paper introduces Logarithmic Memory Networks (LMNs), a novel architecture that leverages a hierarchical logarithmic tree structure to efficiently store and retrieve past information. LMNs dynamically summarize historical context, significantly reducing the memory footprint and computational complexity of attention mechanisms from O(n2) to O(log(n)). The model employs a single-vector, targeted attention mechanism to access stored information, and the memory block construction worker (summarizer) layer operates in two modes: a parallel execution mode during training for efficient processing of hierarchical tree structures and a sequential execution mode during inference, which acts as a memory management system. It also implicitly encodes positional information, eliminating the need for explicit positional encodings. These features make LMNs a robust and scalable solution for processing long-range sequences in resource-constrained environments, offering practical improvements in efficiency and scalability. The code is publicly available under the MIT License on GitHub: this https URL.

[AI-32] Optimal Classification Trees for Continuous Feature Data Using Dynamic Programming with Branch-and-Bound AAAI-25

链接: https://arxiv.org/abs/2501.07903
作者: Catalin E. Brita,Jacobus G. M. van der Linden,Emir Demirović
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: In the proceedings of AAAI-25

点击查看摘要

Abstract:Computing an optimal classification tree that provably maximizes training performance within a given size limit, is NP-hard, and in practice, most state-of-the-art methods do not scale beyond computing optimal trees of depth three. Therefore, most methods rely on a coarse binarization of continuous features to maintain scalability. We propose a novel algorithm that optimizes trees directly on the continuous feature data using dynamic programming with branch-and-bound. We develop new pruning techniques that eliminate many sub-optimal splits in the search when similar to previously computed splits and we provide an efficient subroutine for computing optimal depth-two trees. Our experiments demonstrate that these techniques improve runtime by one or more orders of magnitude over state-of-the-art optimal methods and improve test accuracy by 5% over greedy heuristics.

[AI-33] Anytime Cooperative Implicit Hitting Set Solving

链接: https://arxiv.org/abs/2501.07896
作者: Emma Rollón,Javier Larrosa,Aleksandra Petrova
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Implicit Hitting Set (HS) approach has shown to be very effective for MaxSAT, Pseudo-boolean optimization and other boolean frameworks. Very recently, it has also shown its potential in the very similar Weighted CSP framework by means of the so-called cost-function merging. The original formulation of the HS approach focuses on obtaining increasingly better lower bounds (HS-lb). However, and as shown for Pseudo-Boolean Optimization, this approach can also be adapted to compute increasingly better upper bounds (HS-ub). In this paper we consider both HS approaches and show how they can be easily combined in a multithread architecture where cores discovered by either component are available by the other which, interestingly, generates synergy between them. We show that the resulting algorithm (HS-lub) is consistently superior to either HS-lb and HS-ub in isolation. Most importantly, HS-lub has an effective anytime behaviour with which the optimality gap is reduced during the execution. We tested our approach on the Weighted CSP framework and show on three different benchmarks that our very simple implementation sometimes outperforms the parallel hybrid best-first search implementation of the far more developed state-of-the-art Toulbar2.

[AI-34] Leverag ing Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLM s

链接: https://arxiv.org/abs/2501.07892
作者: Shuai Wang,Liang Ding,Yibing Zhan,Yong Luo,Zheng He,Dapeng Tao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 11 pages,6 figures

点击查看摘要

Abstract:Automated code generation using large language models (LLMs) has gained attention due to its efficiency and adaptability. However, real-world coding tasks or benchmarks like HumanEval and StudentEval often lack dedicated training datasets, challenging existing few-shot prompting approaches that rely on reference examples. Inspired by human metamemory-a cognitive process involving recall and evaluation-we present a novel framework (namely M^2WF) for improving LLMs’ one-time code generation. This approach enables LLMs to autonomously generate, evaluate, and utilize synthetic examples to enhance reliability and performance. Unlike prior methods, it minimizes dependency on curated data and adapts flexibly to various coding scenarios. Our experiments demonstrate significant improvements in coding benchmarks, offering a scalable and robust solution for data-free environments. The code and framework will be publicly available on GitHub and HuggingFace.

[AI-35] Hierarchical Repository-Level Code Summarization for Business Applications Using Local LLM s ICSE2025

链接: https://arxiv.org/abs/2501.07857
作者: Nilesh Dhulshette,Sapan Shah,Vinay Kulkarni
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: To appear at LLM4Code@ICSE 2025

点击查看摘要

Abstract:In large-scale software development, understanding the functionality and intent behind complex codebases is critical for effective development and maintenance. While code summarization has been widely studied, existing methods primarily focus on smaller code units, such as functions, and struggle with larger code artifacts like files and packages. Additionally, current summarization models tend to emphasize low-level implementation details, often overlooking the domain and business context that are crucial for real-world applications. This paper proposes a two-step hierarchical approach for repository-level code summarization, tailored to business applications. First, smaller code units such as functions and variables are identified using syntax analysis and summarized with local LLMs. These summaries are then aggregated to generate higher-level file and package summaries. To ensure the summaries are grounded in business context, we design custom prompts that capture the intended purpose of code artifacts based on the domain and problem context of the business application. We evaluate our approach on a business support system (BSS) for the telecommunications domain, showing that syntax analysis-based hierarchical summarization improves coverage, while business-context grounding enhances the relevance of the generated summaries.

[AI-36] Unveiling Provider Bias in Large Language Models for Code Generation

链接: https://arxiv.org/abs/2501.07849
作者: Xiaoyu Zhang,Juan Zhai,Shiqing Ma,Qingshuang Bao,Weipeng Jiang,Chao Shen,Yang Liu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 21 pages, 15 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as the new recommendation engines, outperforming traditional methods in both capability and scope, particularly in code generation applications. Our research reveals a novel provider bias in LLMs, namely without explicit input prompts, these models show systematic preferences for services from specific providers in their recommendations (e.g., favoring Google Cloud over Microsoft Azure). This bias holds significant implications for market dynamics and societal equilibrium, potentially promoting digital monopolies. It may also deceive users and violate their expectations, leading to various consequences. This paper presents the first comprehensive empirical study of provider bias in LLM code generation. We develop a systematic methodology encompassing an automated pipeline for dataset generation, incorporating 6 distinct coding task categories and 30 real-world application scenarios. Our analysis encompasses over 600,000 LLM-generated responses across seven state-of-the-art models, utilizing approximately 500 million tokens (equivalent to \ 5,000+ in computational costs). The study evaluates both the generated code snippets and their embedded service provider selections to quantify provider bias. Additionally, we conduct a comparative analysis of seven debiasing prompting techniques to assess their efficacy in mitigating these biases. Our findings demonstrate that LLMs exhibit significant provider preferences, predominantly favoring services from Google and Amazon, and can autonomously modify input code to incorporate their preferred providers without users’ requests. Notably, we observe discrepancies between providers recommended in conversational contexts versus those implemented in generated code. The complete dataset and analysis results are available in our repository.

[AI-37] A Driver Advisory System Based on Large Language Model for High-speed Train

链接: https://arxiv.org/abs/2501.07837
作者: Y.C. Luo,J. Xun,W. Wang,R.Z. Zhang,Z.C. Zhao
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages, 7 figures, presented at 104th TRB Annual Meeting

点击查看摘要

Abstract:With the rapid development of China high-speed railway, drivers face increasingly significant technical challenges during operations, such as fault handling. Currently, drivers depend on the onboard mechanic when facing technical issues, for instance, traction loss or sensor faults. This dependency can hinder effective operation, even lead to accidents, while waiting for faults to be addressed. To enhance the accuracy and explainability of actions during fault handling, an Intelligent Driver Advisory System (IDAS) framework based on a large language model (LLM) named IDAS-LLM, is introduced. Initially, domain-fine-tuning of the LLM is performed using a constructed railway knowledge question-and-answer dataset to improve answer accuracy in railway-related questions. Subsequently, integration of the Retrieval-augmented Generation (RAG) architecture is pursued for system design to enhance the explainability of generated responses. Comparative experiments are conducted using the constructed railway driving knowledge assessment dataset. Results indicate that domain-fine-tuned LLMs show an improvement in answer accuracy by an average of 10%, outperforming some current mainstream LLMs. Additionally, the inclusion of the RAG framework increases the average recall rate of question-and-answer sessions by about 4%. Finally, the fault handling capability of IDAS-LLM is demonstrated through simulations of real operational scenarios, proving that the proposed framework has practical application prospects.

[AI-38] Flow: A Modular Approach to Automated Agent ic Workflow Generation

链接: https://arxiv.org/abs/2501.07834
作者: Boye Niu,Yiliao Song,Kai Lian,Yifan Shen,Yu Yao,Kun Zhang,Tongliang Liu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution. However, the effective adjustment of Agentic workflows during execution has not been well-studied. A effective workflow adjustment is crucial, as in many real-world scenarios, the initial plan must adjust to unforeseen challenges and changing conditions in real-time to ensure the efficient execution of complex tasks. In this paper, we define workflows as an activity-on-vertex (AOV) graphs. We continuously refine the workflow by dynamically adjusting task allocations based on historical performance and previous AOV with LLM agents. To further enhance system performance, we emphasize modularity in workflow design based on measuring parallelism and dependence complexity. Our proposed multi-agent framework achieved efficient sub-task concurrent execution, goal achievement, and error tolerance. Empirical results across different practical tasks demonstrate dramatic improvements in the efficiency of multi-agent frameworks through dynamic workflow updating and modularization.

[AI-39] STTS-EAD: Improving Spatio-Temporal Learning Based Time Series Prediction via

链接: https://arxiv.org/abs/2501.07814
作者: Yuanyuan Liang,Tianhao Zhang,Tingyu Xie
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:Handling anomalies is a critical preprocessing step in multivariate time series prediction. However, existing approaches that separate anomaly preprocessing from model training for multivariate time series prediction encounter significant limitations. Specifically, these methods fail to utilize auxiliary information crucial for identifying latent anomalies associated with spatiotemporal factors during the preprocessing stage. Instead, they rely solely on data distribution for anomaly detection, which can result in the incorrect processing of numerous samples that could otherwise contribute positively to model training. To address this, we propose STTS-EAD, an end-to-end method that seamlessly integrates anomaly detection into the training process of multivariate time series forecasting and aims to improve Spatio-Temporal learning based Time Series prediction via Embedded Anomaly Detection. Our proposed STTS-EAD leverages spatio-temporal information for forecasting and anomaly detection, with the two parts alternately executed and optimized for each other. To the best of our knowledge, STTS-EAD is the first to integrate anomaly detection and forecasting tasks in the training phase for improving the accuracy of multivariate time series forecasting. Extensive experiments on a public stock dataset and two real-world sales datasets from a renowned coffee chain enterprise show that our proposed method can effectively process detected anomalies in the training stage to improve forecasting performance in the inference stage and significantly outperform baselines.

[AI-40] Conformal mapping Coordinates Physics-Informed Neural Networks (CoCo-PINNs): learning neural networks for designing neutral inclusions

链接: https://arxiv.org/abs/2501.07809
作者: Daehee Cho,Hyeonmin Yun,Jaeyong Lee,Mikyoung Lim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:We focus on designing and solving the neutral inclusion problem via neural networks. The neutral inclusion problem has a long history in the theory of composite materials, and it is exceedingly challenging to identify the precise condition that precipitates a general-shaped inclusion into a neutral inclusion. Physics-informed neural networks (PINNs) have recently become a highly successful approach to addressing both forward and inverse problems associated with partial differential equations. We found that traditional PINNs perform inadequately when applied to the inverse problem of designing neutral inclusions with arbitrary shapes. In this study, we introduce a novel approach, Conformal mapping Coordinates Physics-Informed Neural Networks (CoCo-PINNs), which integrates complex analysis techniques into PINNs. This method exhibits strong performance in solving forward-inverse problems to construct neutral inclusions of arbitrary shapes in two dimensions, where the imperfect interface condition on the inclusion’s boundary is modeled by training neural networks. Notably, we mathematically prove that training with a single linear field is sufficient to achieve neutrality for untrained linear fields in arbitrary directions, given a minor assumption. We demonstrate that CoCo-PINNs offer enhanced performances in terms of credibility, consistency, and stability.

[AI-41] Visual Language Models as Operator Agents in the Space Domain DATE

链接: https://arxiv.org/abs/2501.07802
作者: Alejandro Carrasco,Marco Nedungadi,Enrico M. Zucchelli,Amit Jain,Victor Rodriguez-Fernandez,Richard Linares
类目: Artificial Intelligence (cs.AI); Space Physics (physics.space-ph)
*备注: Updated version of the paper presented in 2025 AIAA SciTech. this https URL

点击查看摘要

Abstract:This paper explores the application of Vision-Language Models (VLMs) as operator agents in the space domain, focusing on both software and hardware operational paradigms. Building on advances in Large Language Models (LLMs) and their multimodal extensions, we investigate how VLMs can enhance autonomous control and decision-making in space missions. In the software context, we employ VLMs within the Kerbal Space Program Differential Games (KSPDG) simulation environment, enabling the agent to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers. In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites. Our results demonstrate that VLMs can effectively process visual and textual data to generate contextually appropriate actions, competing with traditional methods and non-multimodal LLMs in simulation tasks, and showing promise in real-world applications.

[AI-42] A Comparative Analysis of DNN-based White-Box Explainable AI Methods in Network Security

链接: https://arxiv.org/abs/2501.07801
作者: Osvaldo Arreche,Mustafa Abdallah
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:New research focuses on creating artificial intelligence (AI) solutions for network intrusion detection systems (NIDS), drawing its inspiration from the ever-growing number of intrusions on networked systems, increasing its complexity and intelligibility. Hence, the use of explainable AI (XAI) techniques in real-world intrusion detection systems comes from the requirement to comprehend and elucidate black-box AI models to security analysts. In an effort to meet such requirements, this paper focuses on applying and evaluating White-Box XAI techniques (particularly LRP, IG, and DeepLift) for NIDS via an end-to-end framework for neural network models, using three widely used network intrusion datasets (NSL-KDD, CICIDS-2017, and RoEduNet-SIMARGL2021), assessing its global and local scopes, and examining six distinct assessment measures (descriptive accuracy, sparsity, stability, robustness, efficiency, and completeness). We also compare the performance of white-box XAI methods with black-box XAI methods. The results show that using White-box XAI techniques scores high in robustness and completeness, which are crucial metrics for IDS. Moreover, the source codes for the programs developed for our XAI evaluation framework are available to be improved and used by the research community.

[AI-43] ransforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors

链接: https://arxiv.org/abs/2501.07774
作者: Saad Masrur,Jung-Fu(Thomas)Cheng,Atieh R. Khamesi,Ismail Guvenc
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: The paper has been submitted to IEEE Transactions on Machine Learning in Communications and Networking

点击查看摘要

Abstract:Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to mediocre accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. The proposed tokenization method enables the Vanilla Transformer to achieve a 90th percentile positioning error of 0.388 m in a highly NLOS indoor factory, surpassing conventional tokenization methods. The L-SwiGLU ViT further reduces the error to 0.355 m, achieving an 8.51% improvement. Additionally, the proposed model outperforms a 14.1 times larger model with a 46.13% improvement, underscoring its computational efficiency.

[AI-44] Deep Learning for Disease Outbreak Prediction: A Robust Early Warning Signal for Transcritical Bifurcations

链接: https://arxiv.org/abs/2501.07764
作者: Reza Miry,Amit K. Chakraborty,Russell Greiner,Mark A. Lewis,Hao Wang,Tianyu Guan,Pouria Ramazi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Early Warning Signals (EWSs) are vital for implementing preventive measures before a disease turns into a pandemic. While new diseases exhibit unique behaviors, they often share fundamental characteristics from a dynamical systems perspective. Moreover, measurements during disease outbreaks are often corrupted by different noise sources, posing challenges for Time Series Classification (TSC) tasks. In this study, we address the problem of having a robust EWS for disease outbreak prediction using a best-performing deep learning model in the domain of TSC. We employed two simulated datasets to train the model: one representing generated dynamical systems with randomly selected polynomial terms to model new disease behaviors, and another simulating noise-induced disease dynamics to account for noisy measurements. The model’s performance was analyzed using both simulated data from different disease models and real-world data, including influenza and COVID-19. Results demonstrate that the proposed model outperforms previous models, effectively providing EWSs of impending outbreaks across various scenarios. This study bridges advancements in deep learning with the ability to provide robust early warning signals in noisy environments, making it highly applicable to real-world crises involving emerging disease outbreaks.

[AI-45] Impatient Bandits: Optimizing for the Long-Term Without Delay

链接: https://arxiv.org/abs/2501.07761
作者: Kelly W. Zhang,Thomas Baldwin-McDonald,Kamil Ciosek,Lucas Maystre,Daniel Russo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Increasingly, recommender systems are tasked with improving users’ long-term satisfaction. In this context, we study a content exploration task, which we formalize as a bandit problem with delayed rewards. There is an apparent trade-off in choosing the learning signal: waiting for the full reward to become available might take several weeks, slowing the rate of learning, whereas using short-term proxy rewards reflects the actual long-term goal only imperfectly. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Rewards as well as shorter-term surrogate outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that quickly learns to identify content aligned with long-term success using this new predictive model. We prove a regret bound for our algorithm that depends on the \textitValue of Progressive Feedback, an information theoretic metric that captures the quality of short-term leading indicators that are observed prior to the long-term reward. We apply our approach to a podcast recommendation problem, where we seek to recommend shows that users engage with repeatedly over two months. We empirically validate that our approach significantly outperforms methods that optimize for short-term proxies or rely solely on delayed rewards, as demonstrated by an A/B test in a recommendation system that serves hundreds of millions of users.

[AI-46] Performance Optimization of Ratings-Based Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2501.07755
作者: Evelyn Rose,Devin White,Mingkang Wu,Vernon Lawhern,Nicholas R. Waytowich,Yongcan Cao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the Collaborative AI and Modeling of Humans Bridge Program at AAAI 2025

点击查看摘要

Abstract:This paper explores multiple optimization methods to improve the performance of rating-based reinforcement learning (RbRL). RbRL, a method based on the idea of human ratings, has been developed to infer reward functions in reward-free environments for the subsequent policy learning via standard reinforcement learning, which requires the availability of reward functions. Specifically, RbRL minimizes the cross entropy loss that quantifies the differences between human ratings and estimated ratings derived from the inferred reward. Hence, a low loss means a high degree of consistency between human ratings and estimated ratings. Despite its simple form, RbRL has various hyperparameters and can be sensitive to various factors. Therefore, it is critical to provide comprehensive experiments to understand the impact of various hyperparameters on the performance of RbRL. This paper is a work in progress, providing users some general guidelines on how to select hyperparameters in RbRL.

[AI-47] Rethinking AI Cultural Evaluation

链接: https://arxiv.org/abs/2501.07751
作者: Michal Bravansky,Filip Trhlik,Fazl Barez
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:As AI systems become more integrated into society, evaluating their capacity to align with diverse cultural values is crucial for their responsible deployment. Current evaluation methods predominantly rely on multiple-choice question (MCQ) datasets. In this study, we demonstrate that MCQs are insufficient for capturing the complexity of cultural values expressed in open-ended scenarios. Our findings highlight significant discrepancies between MCQ-based assessments and the values conveyed in unconstrained interactions. Based on these findings, we recommend moving beyond MCQs to adopt more open-ended, context-specific assessments that better reflect how AI models engage with cultural values in realistic settings.

[AI-48] CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory

链接: https://arxiv.org/abs/2501.07674
作者: Haokun Zhao,Jinyi Han,Jiaqing Liang,Yanghua Xiao
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated outstanding capabilities across various domains, but the increasing complexity of new challenges demands enhanced performance and adaptability. Traditional benchmarks, although comprehensive, often lack the granularity needed for detailed capability analysis. This study introduces the Cognitive Diagnostic Synthesis (CDS) method, which employs Cognitive Diagnosis Theory (CDT) for precise evaluation and targeted enhancement of LLMs. By decomposing complex tasks into discrete knowledge points, CDS accurately identifies and synthesizes data targeting model weaknesses, thereby enhancing the model’s performance. This framework proposes a comprehensive pipeline driven by knowledge point evaluation, synthesis, data augmentation, and filtering, which significantly improves the model’s mathematical and coding capabilities, achieving up to an 11.12% improvement in optimal scenarios.

[AI-49] Large Language Models for Interpretable Mental Health Diagnosis AAAI2025 ALT

链接: https://arxiv.org/abs/2501.07653
作者: Brian Hyeongseok Kim,Chao Wang
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Accepted at AAAI 2025 Workshop on Large Language Models and Generative AI for Health (GenAI4Health)

点击查看摘要

Abstract:We propose a clinical decision support system (CDSS) for mental health diagnosis that combines the strengths of large language models (LLMs) and constraint logic programming (CLP). Having a CDSS is important because of the high complexity of diagnostic manuals used by mental health professionals and the danger of diagnostic errors. Our CDSS is a software tool that uses an LLM to translate diagnostic manuals to a logic program and solves the program using an off-the-shelf CLP engine to query a patient’s diagnosis based on the encoded rules and provided data. By giving domain experts the opportunity to inspect the LLM-generated logic program, and making modifications when needed, our CDSS ensures that the diagnosis is not only accurate but also interpretable. We experimentally compare it with two baseline approaches of using LLMs: diagnosing patients using the LLM-only approach, and using the LLM-generated logic program but without expert inspection. The results show that, while LLMs are extremely useful in generating candidate logic programs, these programs still require expert inspection and modification to guarantee faithfulness to the official diagnostic manuals. Additionally, ethical concerns arise from the direct use of patient data in LLMs, underscoring the need for a safer hybrid approach like our proposed method.

[AI-50] SafePowerGraph-LLM : Novel Power Grid Graph Embedding and Optimization with Large Language Models

链接: https://arxiv.org/abs/2501.07639
作者: Fabien Bernier,Jun Cao,Maxime Cordy,Salah Ghamizi
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently solving Optimal Power Flow (OPF) problems in power systems is crucial for operational planning and grid management. There is a growing need for scalable algorithms capable of handling the increasing variability, constraints, and uncertainties in modern power networks while providing accurate and fast solutions. To address this, machine learning techniques, particularly Graph Neural Networks (GNNs) have emerged as promising approaches. This letter introduces SafePowerGraph-LLM, the first framework explicitly designed for solving OPF problems using Large Language Models (LLM)s. The proposed approach combines graph and tabular representations of power grids to effectively query LLMs, capturing the complex relationships and constraints in power systems. A new implementation of in-context learning and fine-tuning protocols for LLMs is introduced, tailored specifically for the OPF problem. SafePowerGraph-LLM demonstrates reliable performances using off-the-shelf LLM. Our study reveals the impact of LLM architecture, size, and fine-tuning and demonstrates our framework’s ability to handle realistic grid components and constraints.

[AI-51] Real-Time Decision-Making for Digital Twin in Additive Manufacturing with Model Predictive Control using Time-Series Deep Neural Networks

链接: https://arxiv.org/abs/2501.07601
作者: Yi-Ping Chen,Vispi Karkaria,Ying-Kuan Tsai,Faith Rolark,Daniel Quispe,Robert X. Gao,Jian Cao,Wei Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Digital Twin-a virtual replica of a physical system enabling real-time monitoring, model updating, prediction, and decision-making-combined with recent advances in machine learning (ML), offers new opportunities for proactive control strategies in autonomous manufacturing. However, achieving real-time decision-making with Digital Twins requires efficient optimization driven by accurate predictions of highly nonlinear manufacturing systems. This paper presents a simultaneous multi-step Model Predictive Control (MPC) framework for real-time decision-making, using a multi-variate deep neural network (DNN), named Time-Series Dense Encoder (TiDE), as the surrogate model. Different from the models in conventional MPC which only provide one-step ahead prediction, TiDE is capable of predicting future states within the prediction horizon in one shot (multi-step), significantly accelerating MPC. Using Directed Energy Deposition additive manufacturing as a case study, we demonstrate the effectiveness of the proposed MPC in achieving melt pool temperature tracking to ensure part quality, while reducing porosity defects by regulating laser power to maintain melt pool depth constraints. In this work, we first show that TiDE is capable of accurately predicting melt pool temperature and depth. Second, we demonstrate that the proposed MPC achieves precise temperature tracking while satisfying melt pool depth constraints within a targeted dilution range (10%-30%), reducing potential porosity defects. Compared to the PID controller, MPC results in smoother and less fluctuating laser power profiles with competitive or superior melt pool temperature control performance. This demonstrates MPC’s proactive control capabilities, leveraging time-series prediction and real-time optimization, positioning it as a powerful tool for future Digital Twin applications and real-time process optimization in manufacturing.

[AI-52] Multi-task Domain Adaptation for Computation Offloading in Edge-intelligence Networks

链接: https://arxiv.org/abs/2501.07585
作者: Runxin Han(1),Bo Yang(1),Zhiwen Yu(1),Xuelin Cao(2),George C. Alexandropoulos(3,4),Chau Yuen(5) ((1) School of Computer Science, Northwestern Polytechnical University, Xi’an, Shaanxi, China, (2) School of Cyber Engineering, Xidian University, Xi’an, Shaanxi, China, (3) Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece, (4) Department of Electrical and Computer Engineering, University of Illinois Chicago, IL, USA, (5) School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of multi-access edge computing (MEC), efficient computation offloading is crucial for improving resource utilization and reducing latency in dynamically changing environments. This paper introduces a new approach, termed as Multi-Task Domain Adaptation (MTDA), aiming to enhance the ability of computational offloading models to generalize in the presence of domain shifts, i.e., when new data in the target environment significantly differs from the data in the source domain. The proposed MTDA model incorporates a teacher-student architecture that allows continuous adaptation without necessitating access to the source domain data during inference, thereby maintaining privacy and reducing computational overhead. Utilizing a multi-task learning framework that simultaneously manages offloading decisions and resource allocation, the proposed MTDA approach outperforms benchmark methods regarding mean squared error and accuracy, particularly in environments with increasing numbers of users. It is observed by means of computer simulation that the proposed MTDA model maintains high performance across various scenarios, demonstrating its potential for practical deployment in emerging MEC applications.

[AI-53] EEG-ReMinD: Enhancing Neurodegenerative EEG Decoding through Self-Supervised State Reconstruction-Primed Riemannian Dynamics

链接: https://arxiv.org/abs/2501.08139
作者: Zirui Wang,Zhenxi Song,Yi Guo,Yuxin Liu,Guoyang Xu,Min Zhang,Zhiguo Zhang
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of EEG decoding algorithms confronts challenges such as data sparsity, subject variability, and the need for precise annotations, all of which are vital for advancing brain-computer interfaces and enhancing the diagnosis of diseases. To address these issues, we propose a novel two-stage approach named Self-Supervised State Reconstruction-Primed Riemannian Dynamics (EEG-ReMinD) , which mitigates reliance on supervised learning and integrates inherent geometric features. This approach efficiently handles EEG data corruptions and reduces the dependency on labels. EEG-ReMinD utilizes self-supervised and geometric learning techniques, along with an attention mechanism, to analyze the temporal dynamics of EEG features within the framework of Riemannian geometry, referred to as Riemannian dynamics. Comparative analyses on both intact and corrupted datasets from two different neurodegenerative disorders underscore the enhanced performance of EEG-ReMinD.

[AI-54] An Empirical Wall-Pressure Spectrum Model for Aeroacoustic Predictions Based on Symbolic Regression

链接: https://arxiv.org/abs/2501.08134
作者: Laura Botero Bolívar,David Huergo,Fernanda L. dos Santos,Cornelis H. Venner,Leandro D. de Santana,Esteban Ferrer
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fast-turn around methods to predict airfoil trailing-edge noise are crucial for incorporating noise limitations into design optimization loops of several applications. Among these aeroacoustic predictive models, Amiet’s theory offers the best balance between accuracy and simplicity. The accuracy of the model relies heavily on precise wall-pressure spectrum predictions, which are often based on single-equation formulations with adjustable parameters. These parameters are calibrated for particular airfoils and flow conditions and consequently tend to fail when applied outside their calibration range. This paper introduces a new wall-pressure spectrum empirical model designed to enhance the robustness and accuracy of current state-of-the-art predictions while widening the range of applicability of the model to different airfoils and flow conditions. The model is developed using AI-based symbolic regression via a genetic-algorithm-based approach, and applied to a dataset of wall-pressure fluctuations measured on NACA 0008 and NACA 63018 airfoils at multiple angles of attack and inflow velocities, covering turbulent boundary layers with both adverse and favorable pressure gradients. Validation against experimental data (outside the training dataset) demonstrates the robustness of the model compared to well-accepted semi-empirical models. Finally, the model is integrated with Amiet’s theory to predict the aeroacoustic noise of a full-scale wind turbine, showing good agreement with experimental measurements.

[AI-55] utorial: VAE as an inference paradigm for neuroimaging

链接: https://arxiv.org/abs/2501.08009
作者: C. Vázquez-García,F. J. Martínez-Murcia,F. Segovia Román,Juan M. Górriz Sáez
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:In this tutorial, we explore Variational Autoencoders (VAEs), an essential framework for unsupervised learning, particularly suited for high-dimensional datasets such as neuroimaging. By integrating deep learning with Bayesian inference, VAEs enable the generation of interpretable latent representations. This tutorial outlines the theoretical foundations of VAEs, addresses practical challenges such as convergence issues and over-fitting, and discusses strategies like the reparameterization trick and hyperparameter optimization. We also highlight key applications of VAEs in neuroimaging, demonstrating their potential to uncover meaningful patterns, including those associated with neurodegenerative processes, and their broader implications for analyzing complex brain data.

[AI-56] raining Hybrid Neural Networks with Multimode Optical Nonlinearities Using Digital Twins

链接: https://arxiv.org/abs/2501.07991
作者: Ilker Oguz,Louis J. E. Suter,Jih-Liang Hsieh,Mustafa Yildirim,Niyazi Ulas Dinc,Christophe Moser,Demetri Psaltis
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:The ability to train ever-larger neural networks brings artificial intelligence to the forefront of scientific and technical discoveries. However, their exponentially increasing size creates a proportionally greater demand for energy and computational hardware. Incorporating complex physical events in networks as fixed, efficient computation modules can address this demand by decreasing the complexity of trainable layers. Here, we utilize ultrashort pulse propagation in multimode fibers, which perform large-scale nonlinear transformations, for this purpose. Training the hybrid architecture is achieved through a neural model that differentiably approximates the optical system. The training algorithm updates the neural simulator and backpropagates the error signal over this proxy to optimize layers preceding the optical one. Our experimental results achieve state-of-the-art image classification accuracies and simulation fidelity. Moreover, the framework demonstrates exceptional resilience to experimental drifts. By integrating low-energy physical systems into neural networks, this approach enables scalable, energy-efficient AI models with significantly reduced computational demands.

[AI-57] On the Statistical Capacity of Deep Generative Models

链接: https://arxiv.org/abs/2501.07763
作者: Edric Tam,David B. Dunson
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Deep generative models are routinely used in generating samples from complex, high-dimensional distributions. Despite their apparent successes, their statistical properties are not well understood. A common assumption is that with enough training data and sufficiently large neural networks, deep generative model samples will have arbitrarily small errors in sampling from any continuous target distribution. We set up a unifying framework that debunks this belief. We demonstrate that broad classes of deep generative models, including variational autoencoders and generative adversarial networks, are not universal generators. Under the predominant case of Gaussian latent variables, these models can only generate concentrated samples that exhibit light tails. Using tools from concentration of measure and convex geometry, we give analogous results for more general log-concave and strongly log-concave latent variable distributions. We extend our results to diffusion models via a reduction argument. We use the Gromov–Levy inequality to give similar guarantees when the latent variables lie on manifolds with positive Ricci curvature. These results shed light on the limited capacity of common deep generative models to handle heavy tails. We illustrate the empirical relevance of our work with simulations and financial data.

[AI-58] A Hybrid Framework for Reinsurance Optimization: Integrating Generative Models and Reinforcement Learning

链接: https://arxiv.org/abs/2501.06404
作者: Stella C. Dong,James R. Finlay
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinsurance optimization is critical for insurers to manage risk exposure, ensure financial stability, and maintain solvency. Traditional approaches often struggle with dynamic claim distributions, high-dimensional constraints, and evolving market conditions. This paper introduces a novel hybrid framework that integrates Generative Models, specifically Variational Autoencoders (VAEs), with Reinforcement Learning (RL) using Proximal Policy Optimization (PPO). The framework enables dynamic and scalable optimization of reinsurance strategies by combining the generative modeling of complex claim distributions with the adaptive decision-making capabilities of reinforcement learning. The VAE component generates synthetic claims, including rare and catastrophic events, addressing data scarcity and variability, while the PPO algorithm dynamically adjusts reinsurance parameters to maximize surplus and minimize ruin probability. The framework’s performance is validated through extensive experiments, including out-of-sample testing, stress-testing scenarios (e.g., pandemic impacts, catastrophic events), and scalability analysis across portfolio sizes. Results demonstrate its superior adaptability, scalability, and robustness compared to traditional optimization techniques, achieving higher final surpluses and computational efficiency. Key contributions include the development of a hybrid approach for high-dimensional optimization, dynamic reinsurance parameterization, and validation against stochastic claim distributions. The proposed framework offers a transformative solution for modern reinsurance challenges, with potential applications in multi-line insurance operations, catastrophe modeling, and risk-sharing strategy design. Subjects: Econometrics (econ.EM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2501.06404 [econ.EM] (or arXiv:2501.06404v1 [econ.EM] for this version) https://doi.org/10.48550/arXiv.2501.06404 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] Gradient Equilibrium in Online Learning: Theory and Applications

链接: https://arxiv.org/abs/2501.08330
作者: Anastasios N. Angelopoulos,Michael I. Jordan,Ryan J. Tibshirani
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Code available at this https URL

点击查看摘要

Abstract:We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by nor implies sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.

[LG-1] A Similarity Measure Between Functions with Applications to Statistical Learning and Optimization

链接: https://arxiv.org/abs/2501.08317
作者: Chengpiao Huang,Kaizheng Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 9 pages

点击查看摘要

Abstract:In this note, we present a novel measure of similarity between two functions. It quantifies how the sub-optimality gaps of two functions convert to each other, and unifies several existing notions of functional similarity. We show that it has convenient operation rules, and illustrate its use in empirical risk minimization and non-stationary online optimization.

[LG-2] Path Loss Prediction Using Machine Learning with Extended Features

链接: https://arxiv.org/abs/2501.08306
作者: Jonathan Ethier,Mathieu Chateauvert,Ryan G. Dempsey,Alexis Bose
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 4 pages, 4 figures, conference paper

点击查看摘要

Abstract:Wireless communications rely on path loss modeling, which is most effective when it includes the physical details of the propagation environment. Acquiring this data has historically been challenging, but geographic information system data is becoming increasingly available with higher resolution and accuracy. Access to such details enables propagation models to more accurately predict coverage and minimize interference in wireless deployments. Machine learning-based modeling can significantly support this effort, with feature-based approaches allowing for accurate, efficient, and scalable propagation modeling. Building on previous work, we introduce an extended set of features that improves prediction accuracy while, most importantly, maintaining model generalization across a broad range of environments.

[LG-3] Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series Classification

链接: https://arxiv.org/abs/2501.08305
作者: Wennuo Yang,Shiling Wu,Yuzhi Zhou,Weicheng Xie,Linlin Shen,Siyang Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate Time Series Classification (MTSC) enables the analysis if complex temporal data, and thus serves as a cornerstone in various real-world applications, ranging from healthcare to finance. Since the relationship among variables in MTS usually contain crucial cues, a large number of graph-based MTSC approaches have been proposed, as the graph topology and edges can explicitly represent relationships among variables (channels), where not only various MTS graph representation learning strategies but also different Graph Neural Networks (GNNs) have been explored. Despite such progresses, there is no comprehensive study that fairly benchmarks and investigates the performances of existing widely-used graph representation learning strategies/GNN classifiers in the application of different MTSC tasks. In this paper, we present the first benchmark which systematically investigates the effectiveness of the widely-used three node feature definition strategies, four edge feature learning strategies and five GNN architecture, resulting in 60 different variants for graph-based MTSC. These variants are developed and evaluated with a standardized data pipeline and training/validation/testing strategy on 26 widely-used suspensor MTSC datasets. Our experiments highlight that node features significantly influence MTSC performance, while the visualization of edge features illustrates why adaptive edge learning outperforms other edge feature learning methods. The code of the proposed benchmark is publicly available at \urlthis https URL.

[LG-4] Decoding Interpretable Logic Rules from Neural Networks

链接: https://arxiv.org/abs/2501.08281
作者: Chuqin Geng,Xiaojie Xu,Zhaoyue Wang,Ziyu Zhao,Xujie Si
类目: Machine Learning (cs.LG)
*备注: 23 pages, 7 figures

点击查看摘要

Abstract:As deep neural networks continue to excel across various domains, their black-box nature has raised concerns about transparency and trust. In particular, interpretability has become increasingly essential for applications that demand high safety and knowledge rigor, such as drug discovery, autonomous driving, and genomics. However, progress in understanding even the simplest deep neural networks - such as fully connected networks - has been limited, despite their role as foundational elements in state-of-the-art models like ResNet and Transformer. In this paper, we address this challenge by introducing NeuroLogic, a novel approach for decoding interpretable logic rules from neural networks. NeuroLogic leverages neural activation patterns to capture the model’s critical decision-making processes, translating them into logical rules represented by hidden predicates. Thanks to its flexible design in the grounding phase, NeuroLogic can be adapted to a wide range of neural networks. For simple fully connected neural networks, hidden predicates can be grounded in certain split patterns of original input features to derive decision-tree-like rules. For large, complex vision neural networks, NeuroLogic grounds hidden predicates into high-level visual concepts that are understandable to humans. Our empirical study demonstrates that NeuroLogic can extract global and interpretable rules from state-of-the-art models such as ResNet, a task at which existing work struggles. We believe NeuroLogic can help pave the way for understanding the black-box nature of neural networks.

[LG-5] Multiplayer Federated Learning: Reaching Equilibrium with Less Communication

链接: https://arxiv.org/abs/2501.08263
作者: TaeHo Yoon,Sayantan Choudhury,Nicolas Loizou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 43 pages, 5 figures

点击查看摘要

Abstract:Traditional Federated Learning (FL) approaches assume collaborative clients with aligned objectives working towards a shared global model. However, in many real-world scenarios, clients act as rational players with individual objectives and strategic behaviors, a concept that existing FL frameworks are not equipped to adequately address. To bridge this gap, we introduce Multiplayer Federated Learning (MpFL), a novel framework that models the clients in the FL environment as players in a game-theoretic context, aiming to reach an equilibrium. In this scenario, each player tries to optimize their own utility function, which may not align with the collective goal. Within MpFL, we propose Per-Player Local Stochastic Gradient Descent (PEARL-SGD), an algorithm in which each player/client performs local updates independently and periodically communicates with other players. We theoretically analyze PEARL-SGD and prove that it reaches a neighborhood of equilibrium with less communication in the stochastic setup compared to its non-local counterpart. Finally, we verify our theoretical findings through numerical experiments.

[LG-6] FDPP: Fine-tune Diffusion Policy with Human Preference

链接: https://arxiv.org/abs/2501.08259
作者: Yuxin Chen,Devesh K. Jha,Masayoshi Tomizuka,Diego Romeres
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.

[LG-7] xt-Diffusion Red-Teaming of Large Language Models : Unveiling Harmful Behaviors with Proximity Constraints AAAI25

链接: https://arxiv.org/abs/2501.08246
作者: Jonathan Nöther,Adish Singla,Goran Radanović
类目: Machine Learning (cs.LG)
*备注: This is an extended version of a paper published at AAAI 25

点击查看摘要

Abstract:Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced. We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting. Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.

[LG-8] Privacy-Preserving Model and Preprocessing Verification for Machine Learning

链接: https://arxiv.org/abs/2501.08236
作者: Wenbiao Li,Anisa Halimi,Xiaoqian Jiang,Jaideep Vaidya,Erman Ayday
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a framework for privacy-preserving verification of machine learning models, focusing on models trained on sensitive data. Integrating Local Differential Privacy (LDP) with model explanations from LIME and SHAP, our framework enables robust verification without compromising individual privacy. It addresses two key tasks: binary classification, to verify if a target model was trained correctly by applying the appropriate preprocessing steps, and multi-class classification, to identify specific preprocessing errors. Evaluations on three real-world datasets-Diabetes, Adult, and Student Record-demonstrate that while the ML-based approach is particularly effective in binary tasks, the threshold-based method performs comparably in multi-class tasks. Results indicate that although verification accuracy varies across datasets and noise levels, the framework provides effective detection of preprocessing errors, strong privacy guarantees, and practical applicability for safeguarding sensitive data.

[LG-9] Big Batch Bayesian Active Learning by Considering Predictive Probabilities NEURIPS

链接: https://arxiv.org/abs/2501.08223
作者: Sebastian W. Ober,Samuel Power,Tom Diethe,Henry B. Moss
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 7 pages, 2 figures; presented as a lightning talk at the NeurIPS Workshop on Bayesian Decision-making and Uncertainty (BDU; 2024)

点击查看摘要

Abstract:We observe that BatchBALD, a popular acquisition function for batch Bayesian active learning for classification, can conflate epistemic and aleatoric uncertainty, leading to suboptimal performance. Motivated by this observation, we propose to focus on the predictive probabilities, which only exhibit epistemic uncertainty. The result is an acquisition function that not only performs better, but is also faster to evaluate, allowing for larger batches than before.

[LG-10] Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings

链接: https://arxiv.org/abs/2501.08219
作者: Paul Joe Maliakel,Shashikant Ilager,Ivona Brandic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown significant improvements in many natural language processing (NLP) tasks, accelerating their rapid adoption across many industries. These models are resource-intensive, requiring extensive computational resources both during training and inference, leading to increased energy consumption and negative environmental impact. As their adoption accelerates, the sustainability of LLMs has become a critical issue, necessitating strategies to optimize their runtime efficiency without compromising performance. Hence, it is imperative to identify the parameters that significantly influence the performance and energy efficiency of LLMs. To that end, in this work, we investigate the effect of important parameters on the performance and energy efficiency of LLMs during inference and examine their trade-offs. First, we analyze how different types of models with varying numbers of parameters and architectures perform on tasks like text generation, question answering, and summarization by benchmarking LLMs such as Falcon-7B, Mistral-7B-v0.1, T5-3B, GPT-2, GPT-J-6B, and GPT-Neo-2.7B. Second, we study input and output sequence characteristics such as sequence length concerning energy consumption, performance, and throughput. Finally, we explore the impact of hardware-based power-saving techniques, i.e., Dynamic Voltage Frequency Scaling (DVFS), on the models’ latency and energy efficiency. Our extensive benchmarking and statistical analysis reveal many interesting findings, uncovering how specific optimizations can reduce energy consumption while maintaining throughput and accuracy. This study provides actionable insights for researchers and practitioners to design energy-efficient LLM inference systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.08219 [cs.LG] (or arXiv:2501.08219v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.08219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Modeling Quantum Machine Learning for Genomic Data Analysis

链接: https://arxiv.org/abs/2501.08193
作者: Navneet Singh,Shiva Raj Pokhrel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) continues to evolve, unlocking new opportunities for diverse applications. In this study, we investigate and evaluate the applicability of QML models for binary classification of genome sequence data by employing various feature mapping techniques. We present an open-source, independent Qiskit-based implementation to conduct experiments on a benchmark genomic dataset. Our simulations reveal that the interplay between feature mapping techniques and QML algorithms significantly influences performance. Notably, the Pegasos Quantum Support Vector Classifier (Pegasos-QSVC) exhibits high sensitivity, particularly excelling in recall metrics, while Quantum Neural Networks (QNN) achieve the highest training accuracy across all feature maps. However, the pronounced variability in classifier performance, dependent on feature mapping, highlights the risk of overfitting to localized output distributions in certain scenarios. This work underscores the transformative potential of QML for genomic data classification while emphasizing the need for continued advancements to enhance the robustness and accuracy of these methodologies.

[LG-12] Inference-Time-Compute: More Faithful? A Research Note

链接: https://arxiv.org/abs/2501.08156
作者: James Chua,Owain Evans
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Models trained specifically to generate long Chains of Thought (CoTs) have recently achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) models. Are the CoTs of ITC models more faithful compared to traditional non-ITC models? We evaluate two ITC models (based on Qwen-2.5 and Gemini-2) on an existing test of faithful CoT To measure faithfulness, we test if models articulate cues in their prompt that influence their answers to MMLU questions. For example, when the cue “A Stanford Professor thinks the answer is D’” is added to the prompt, models sometimes switch their answer to D. In such cases, the Gemini ITC model articulates the cue 54% of the time, compared to 14% for the non-ITC Gemini. We evaluate 7 types of cue, such as misleading few-shot examples and anchoring on past responses. ITC models articulate cues that influence them much more reliably than all the 6 non-ITC models tested, such as Claude-3.5-Sonnet and GPT-4o, which often articulate close to 0% of the time. However, our study has important limitations. We evaluate only two ITC models – we cannot evaluate OpenAI’s SOTA o1 model. We also lack details about the training of these ITC models, making it hard to attribute our findings to specific processes. We think faithfulness of CoT is an important property for AI Safety. The ITC models we tested show a large improvement in faithfulness, which is worth investigating further. To speed up this investigation, we release these early results as a research note. Comments: 7 pages, 5 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.08156 [cs.LG] (or arXiv:2501.08156v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.08156 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: James Chua [view email] [v1] Tue, 14 Jan 2025 14:31:45 UTC (293 KB)

[LG-13] Smooth Handovers via Smoothed Online Learning

链接: https://arxiv.org/abs/2501.08099
作者: Michail Kalntis,Andra Lutu,Jesús Omaña Iglesias,Fernando A. Kuipers,George Iosifidis
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With users demanding seamless connectivity, handovers (HOs) have become a fundamental element of cellular networks. However, optimizing HOs is a challenging problem, further exacerbated by the growing complexity of mobile networks. This paper presents the first countrywide study of HO optimization, through the prism of Smoothed Online Learning (SOL). We first analyze an extensive dataset from a commercial mobile network operator (MNO) in Europe with more than 40M users, to understand and reveal important features and performance impacts on HOs. Our findings highlight a correlation between HO failures/delays, and the characteristics of radio cells and end-user devices, showcasing the impact of heterogeneity in mobile networks nowadays. We subsequently model UE-cell associations as dynamic decisions and propose a realistic system model for smooth and accurate HOs that extends existing approaches by (i) incorporating device and cell features on HO optimization, and (ii) eliminating (prior) strong assumptions about requiring future signal measurements and knowledge of end-user mobility. Our algorithm, aligned with the O-RAN paradigm, provides robust dynamic regret guarantees, even in challenging environments, and shows superior performance in multiple scenarios with real-world and synthetic data.

[LG-14] CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning

链接: https://arxiv.org/abs/2501.08071
作者: Guoliang He,Eiko Yoneki
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: cgo 2025

点击查看摘要

Abstract:Large language models (LLMs) are remarked by their substantial computational requirements. To mitigate the cost, researchers develop specialized CUDA kernels, which often fuse several tensor operations to maximize the utilization of GPUs as much as possible. However, those specialized kernels may still leave performance on the table as CUDA assembly experts show that manual optimization of GPU SASS schedules can lead to better performance, and trial-and-error is largely employed to manually find the best GPU SASS schedules. In this work, we employ an automatic approach to optimize GPU SASS schedules, which thus can be integrated into existing compiler frameworks. The key to automatic optimization is training an RL agent to mimic how human experts perform manual scheduling. To this end, we formulate an assembly game, where RL agents can play to find the best GPU SASS schedules. The assembly game starts from a \textit-O3 optimized SASS schedule, and the RL agents can iteratively apply actions to mutate the current schedules. Positive rewards are generated if the mutated schedules get higher throughput by executing on GPUs. Experiments show that CuAsmRL can further improve the performance of existing specialized CUDA kernels transparently by up to 26% , and on average 9% . Moreover, it is used as a tool to reveal potential optimization moves learned automatically. Comments: cgo 2025 Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2501.08071 [cs.AR] (or arXiv:2501.08071v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2501.08071 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Optimal Policy Adaptation under Covariate Shift

链接: https://arxiv.org/abs/2501.08067
作者: Xueqing Liu,Qinwei Yang,Zhaoqing Tian,Ruocheng Guo,Peng Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning of prediction models has been extensively studied, while the corresponding policy learning approaches are rarely discussed. In this paper, we propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets: one with full information from the source domain and the other from the target domain with only covariates. First, under the setting of covariate shift, we formulate the problem from a perspective of causality and present the identifiability assumptions for the reward induced by a given policy. Then, we derive the efficient influence function and the semiparametric efficiency bound for the reward. Based on this, we construct a doubly robust and semiparametric efficient estimator for the reward and then learn the optimal policy by optimizing the estimated reward. Moreover, we theoretically analyze the bias and the generalization error bound for the learned policy. Furthermore, in the presence of both covariate and concept shifts, we propose a novel sensitivity analysis method to evaluate the robustness of the proposed policy learning approach. Extensive experiments demonstrate that the approach not only estimates the reward more accurately but also yields a policy that closely approximates the theoretically optimal policy.

[LG-16] UFGraphFR: An attempt at a federated recommendation system based on user text characteristics

链接: https://arxiv.org/abs/2501.08044
作者: Xudong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning has become an important research area in ‘private computing’ due to the ‘useable invisibility’ of data during training. Inspired by Federated learning, the federated recommendation system has gradually become a new recommendation service architecture that can protect users’ privacy. The use of user diagrams to enhance federated recommendations is a promising topic. How to use user diagrams to enhance federated recommendations is a promising research topic. However, it’s a great challenge to construct a user diagram without compromising privacy in a federated learning scenario. Inspired by the simple idea that similar users often have the same attribute characteristics, we propose a personalized federated recommendation algorithm based on the user relationship graph constructed by the user text characteristics(Graph Federation Recommendation System based on User Text description Features, UFGraphFR). The method uses the embedding layer weight of the user’s text feature description to construct the user relationship graph. It introduces the Transformer mechanism to capture the sequence modeling of the user’s historical interaction sequence. Without access to user history interactions and specific user attributes, the federal learning privacy protection of data ‘useable invisibility’ is embodied. Preliminary experiments on some benchmark datasets demonstrate the superior performance of UFGraphFR. Our experiments show that this model can protect user privacy to some extent without affecting the performance of the recommendation system. The code will be easily available on this https URL.

[LG-17] PolyLUT: Ultra-low Latency Polynomial Inference with Hardware-Aware Structured Pruning

链接: https://arxiv.org/abs/2501.08043
作者: Marta Andronic,Jiawen Li,George A. Constantinides
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: arXiv admin note: text overlap with arXiv:2309.02334

点击查看摘要

Abstract:Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded these operations inside FPGA lookup tables (LUTs). However, FPGA LUTs can implement a much greater variety of functions. In this paper, we propose a novel approach to training DNNs for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. By using polynomial building blocks, we achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. LUT-based implementations also face a significant challenge: the LUT size grows exponentially with the number of inputs. Prior work relies on a priori fixed sparsity, with results heavily dependent on seed selection. To address this, we propose a structured pruning strategy using a bespoke hardware-aware group regularizer that encourages a particular sparsity pattern that leads to a small number of inputs per neuron. We demonstrate the effectiveness of PolyLUT on three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and MNIST.

[LG-18] Convergence Analysis of Real-time Recurrent Learning (RTRL) for a class of Recurrent Neural Networks

链接: https://arxiv.org/abs/2501.08040
作者: Samuel Chun-Hei Lam,Justin Sirignano,Konstantinos Spiliopoulos
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) are commonly trained with the truncated backpropagation-through-time (TBPTT) algorithm. For the purposes of computational tractability, the TBPTT algorithm truncates the chain rule and calculates the gradient on a finite block of the overall data sequence. Such approximation could lead to significant inaccuracies, as the block length for the truncated backpropagation is typically limited to be much smaller than the overall sequence length. In contrast, Real-time recurrent learning (RTRL) is an online optimization algorithm which asymptotically follows the true gradient of the loss on the data sequence as the number of sequence time steps t \rightarrow \infty . RTRL forward propagates the derivatives of the RNN hidden/memory units with respect to the parameters and, using the forward derivatives, performs online updates of the parameters at each time step in the data sequence. RTRL’s online forward propagation allows for exact optimization over extremely long data sequences, although it can be computationally costly for models with large numbers of parameters. We prove convergence of the RTRL algorithm for a class of RNNs. The convergence analysis establishes a fixed point for the joint distribution of the data sequence, RNN hidden layer, and the RNN hidden layer forward derivatives as the number of data samples from the sequence and the number of training steps tend to infinity. We prove convergence of the RTRL algorithm to a stationary point of the loss. Numerical studies illustrate our theoretical results. One potential application area for RTRL is the analysis of financial data, which typically involve long time series and models with small to medium numbers of parameters. This makes RTRL computationally tractable and a potentially appealing optimization method for training models. Thus, we include an example of RTRL applied to limit order book data.

[LG-19] Enhanced SPS Velocity-adaptive Scheme: Access Fariness in 5G NR V2I Networks

链接: https://arxiv.org/abs/2501.08037
作者: Xiao Xu,Qiong Wu,Pingyi Fan,Kezhi Wang
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:Vehicle-to-Infrastructure (V2I) technology enables information exchange between vehicles and road infrastructure. Specifically, when a vehicle approaches a roadside unit (RSU), it can exchange information with the RSU to obtain accurate data that assists in driving. With the release of the 3rd Generation Partnership Project (3GPP) Release 16, which includes the 5G New Radio (NR) Vehicle-to-Everything (V2X) standards, vehicles typically adopt mode-2 communication using sensing-based semi-persistent scheduling (SPS) for resource allocation. In this approach, vehicles identify candidate resources within a selection window and exclude ineligible resources based on information from a sensing window. However, vehicles often drive at different speeds, resulting in varying amounts of data transmission with RSUs as they pass by, which leads to unfair access. Therefore, it is essential to design an access scheme that accounts for different vehicle speeds to achieve fair access across the network. This paper formulates an optimization problem for vehicular networks and proposes a multi-objective optimization scheme to address it by adjusting the selection window in the SPS mechanism of 5G NR V2I mode-2. Simulation results demonstrate the effectiveness of the proposed scheme

[LG-20] Unsupervised Feature Construction for Anomaly Detection in Time Series – An Evaluation

链接: https://arxiv.org/abs/2501.07999
作者: Marine Hamon,Vincent Lemaire,Nour Eddine Yassine Nair-Benrekia,Samuel Berlemont,Julien Cumin
类目: Machine Learning (cs.LG)
*备注: 7

点击查看摘要

Abstract:To detect anomalies with precision and without prior knowledge in time series, is it better to build a detector from the initial temporal representation, or to compute a new (tabular) representation using an existing automatic variable construction library? In this article, we address this question by conducting an in-depth experimental study for two popular detectors (Isolation Forest and Local Outlier Factor). The obtained results, for 5 different datasets, show that the new representation, computed using the tsfresh library, allows Isolation Forest to significantly improve its performance.

[LG-21] Reward Compatibility: A Framework for Inverse RL

链接: https://arxiv.org/abs/2501.07996
作者: Filippo Lazzati,Mirco Mutti,Alberto Metelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We provide an original theoretical study of Inverse Reinforcement Learning (IRL) through the lens of reward compatibility, a novel framework to quantify the compatibility of a reward with the given expert’s demonstrations. Intuitively, a reward is more compatible with the demonstrations the closer the performance of the expert’s policy computed with that reward is to the optimal performance for that reward. This generalizes the notion of feasible reward set, the most common framework in the theoretical IRL literature, for which a reward is either compatible or not compatible. The grayscale introduced by the reward compatibility is the key to extend the realm of provably efficient IRL far beyond what is attainable with the feasible reward set: from tabular to large-scale MDPs. We analyze the IRL problem across various settings, including optimal and suboptimal expert’s demonstrations and both online and offline data collection. For all of these dimensions, we provide a tractable algorithm and corresponding sample complexity analysis, as well as various insights on reward compatibility and how the framework can pave the way to yet more general problem settings.

[LG-22] CHEQ-ing the Box: Safe Variable Impedance Learning for Robotic Polishing

链接: https://arxiv.org/abs/2501.07985
作者: Emma Cramer,Lukas Jäschke,Sebastian Trimpe
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic systems are increasingly employed for industrial automation, with contact-rich tasks like polishing requiring dexterity and compliant behaviour. These tasks are difficult to model, making classical control challenging. Deep reinforcement learning (RL) offers a promising solution by enabling the learning of models and control policies directly from data. However, its application to real-world problems is limited by data inefficiency and unsafe exploration. Adaptive hybrid RL methods blend classical control and RL adaptively, combining the strengths of both: structure from control and learning from RL. This has led to improvements in data efficiency and exploration safety. However, their potential for hardware applications remains underexplored, with no evaluations on physical systems to date. Such evaluations are critical to fully assess the practicality and effectiveness of these methods in real-world settings. This work presents an experimental demonstration of the hybrid RL algorithm CHEQ for robotic polishing with variable impedance, a task requiring precise force and velocity tracking. In simulation, we show that variable impedance enhances polishing performance. We compare standalone RL with adaptive hybrid RL, demonstrating that CHEQ achieves effective learning while adhering to safety constraints. On hardware, CHEQ achieves effective polishing behaviour, requiring only eight hours of training and incurring just five failures. These results highlight the potential of adaptive hybrid RL for real-world, contact-rich tasks trained directly on hardware.

[LG-23] Phase of Flight Classification in Aviation Safety using LSTM GRU and BiLSTM: A Case Study with ASN Dataset

链接: https://arxiv.org/abs/2501.07925
作者: Aziida Nanyonga,Hassan Wasswa,Graham Wild
类目: Machine Learning (cs.LG)
*备注: Aviation Safety, Deep learning algorithms, Flight phase, NLP, ASN, and Classification

点击查看摘要

Abstract:Safety is the main concern in the aviation industry, where even minor operational issues can lead to serious consequences. This study addresses the need for comprehensive aviation accident analysis by leveraging natural language processing (NLP) and advanced AI models to classify the phase of flight from unstructured aviation accident analysis narratives. The research aims to determine whether the phase of flight can be inferred from narratives of post-accident events using NLP techniques. The classification performance of various deep learning models was evaluated. For single RNN-based models, LSTM achieved an accuracy of 63%, precision 60%, and recall 61%. BiLSTM recorded an accuracy of 64%, precision 63%, and a recall of 64%. GRU exhibited balanced performance with an accuracy and recall of 60% and a precision of 63%. Joint RNN-based models further enhanced predictive capabilities. GRU-LSTM, LSTM-BiLSTM, and GRU-BiLSTM demonstrated accuracy rates of 62%, 67%, and 60%, respectively, showcasing the benefits of combining these architectures. To provide a comprehensive overview of model performance, single and combined models were compared in terms of the various metrics. These results underscore the models’ capacity to classify the phase of flight from raw text narratives, equipping aviation industry stakeholders with valuable insights for proactive decision-making. Therefore, this research signifies a substantial advancement in the application of NLP and deep learning models to enhance aviation safety.

[LG-24] MD-Syn: Synergistic drug combination prediction based on the multidimensional feature fusion method and attention mechanisms

链接: https://arxiv.org/abs/2501.07884
作者: XinXin Ge,Yi-Ting Lee,Shan-Ju Yeh
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Drug combination therapies have shown promising therapeutic efficacy in complex diseases and have demonstrated the potential to reduce drug resistance. However, the huge number of possible drug combinations makes it difficult to screen them all in traditional experiments. In this study, we proposed MD-Syn, a computational framework, which is based on the multidimensional feature fusion method and multi-head attention mechanisms. Given drug pair-cell line triplets, MD-Syn considers one-dimensional and two-dimensional feature spaces simultaneously. It consists of a one-dimensional feature embedding module (1D-FEM), a two-dimensional feature embedding module (2D-FEM), and a deep neural network-based classifier for synergistic drug combination prediction. MD-Syn achieved the AUROC of 0.919 in 5-fold cross-validation, outperforming the state-of-the-art methods. Further, MD-Syn showed comparable results over two independent datasets. In addition, the multi-head attention mechanisms not only learn embeddings from different feature aspects but also focus on essential interactive feature elements, improving the interpretability of MD-Syn. In summary, MD-Syn is an interpretable framework to prioritize synergistic drug combination pairs with chemicals and cancer cell line gene expression profiles. To facilitate broader community access to this model, we have developed a web portal (this https URL) that enables customized predictions of drug combination synergy effects based on user-specified compounds.

[LG-25] Distributed Nonparametric Estimation: from Sparse to Dense Samples per Terminal

链接: https://arxiv.org/abs/2501.07879
作者: Deheng Yuan,Tao Guo,Zhongyi Huang
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Consider the communication-constrained problem of nonparametric function estimation, in which each distributed terminal holds multiple i.i.d. samples. Under certain regularity assumptions, we characterize the minimax optimal rates for all regimes, and identify phase transitions of the optimal rates as the samples per terminal vary from sparse to dense. This fully solves the problem left open by previous works, whose scopes are limited to regimes with either dense samples or a single sample per terminal. To achieve the optimal rates, we design a layered estimation protocol by exploiting protocols for the parametric density estimation problem. We show the optimality of the protocol using information-theoretic methods and strong data processing inequalities, and incorporating the classic balls and bins model. The optimal rates are immediate for various special cases such as density estimation, Gaussian, binary, Poisson and heteroskedastic regression models.

[LG-26] Prediction Interval Construction Method for Electricity Prices

链接: https://arxiv.org/abs/2501.07827
作者: Xin Lu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Accurate prediction of electricity prices plays an essential role in the electricity market. To reflect the uncertainty of electricity prices, price intervals are predicted. This paper proposes a novel prediction interval construction method. A conditional generative adversarial network is first presented to generate electricity price scenarios, with which the prediction intervals can be constructed. Then, different generated scenarios are stacked to obtain the probability densities, which can be applied to accurately reflect the uncertainty of electricity prices. Furthermore, a reinforced prediction mechanism based on the volatility level of weather factors is introduced to address the spikes or volatile prices. A case study is conducted to verify the effectiveness of the proposed novel prediction interval construction method. The method can also provide the probability density of each price scenario within the prediction interval and has the superiority to address the volatile prices and price spikes with a reinforced prediction mechanism.

[LG-27] Linearly Convergent Mixup Learning

链接: https://arxiv.org/abs/2501.07794
作者: Gakuto Obi,Ayato Saito,Yuto Sasaki,Tsuyoshi Kato
类目: Machine Learning (cs.LG)
*备注: none

点击查看摘要

Abstract:Learning in the reproducing kernel Hilbert space (RKHS) such as the support vector machine has been recognized as a promising technique. It continues to be highly effective and competitive in numerous prediction tasks, particularly in settings where there is a shortage of training data or computational limitations exist. These methods are especially valued for their ability to work with small datasets and their interpretability. To address the issue of limited training data, mixup data augmentation, widely used in deep learning, has remained challenging to apply to learning in RKHS due to the generation of intermediate class labels. Although gradient descent methods handle these labels effectively, dual optimization approaches are typically not directly applicable. In this study, we present two novel algorithms that extend to a broader range of binary classification models. Unlike gradient-based approaches, our algorithms do not require hyperparameters like learning rates, simplifying their implementation and optimization. Both the number of iterations to converge and the computational cost per iteration scale linearly with respect to the dataset size. The numerical experiments demonstrate that our algorithms achieve faster convergence to the optimal solution compared to gradient descent approaches, and that mixup data augmentation consistently improves the predictive performance across various loss functions.

[LG-28] Symmetry-Aware Generative Modeling through Learned Canonicalization

链接: https://arxiv.org/abs/2501.07773
作者: Kusha Sareen,Daniel Levy,Arnab Kumar Mondal,Sékou-Oumar Kaba,Tara Akhound-Sadegh,Siamak Ravanbakhsh
类目: Machine Learning (cs.LG)
*备注: NeurReps 2024 Workshop Version

点击查看摘要

Abstract:Generative modeling of symmetric densities has a range of applications in AI for science, from drug discovery to physics simulations. The existing generative modeling paradigm for invariant densities combines an invariant prior with an equivariant generative process. However, we observe that this technique is not necessary and has several drawbacks resulting from the limitations of equivariant networks. Instead, we propose to model a learned slice of the density so that only one representative element per orbit is learned. To accomplish this, we learn a group-equivariant canonicalization network that maps training samples to a canonical pose and train a non-equivariant generative model over these canonicalized samples. We implement this idea in the context of diffusion models. Our preliminary experimental results on molecular modeling are promising, demonstrating improved sample quality and faster inference time.

[LG-29] PINN-FEM: A Hybrid Approach for Enforcing Dirichlet Boundary Conditions in Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2501.07765
作者: Nahil Sobh,Rini Jasmine Gladstone,Hadi Meidani
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) solve partial differential equations (PDEs) by embedding governing equations and boundary/initial conditions into the loss function. However, enforcing Dirichlet boundary conditions accurately remains challenging, often leading to soft enforcement that compromises convergence and reliability in complex domains. We propose a hybrid approach, PINN-FEM, which combines PINNs with finite element methods (FEM) to impose strong Dirichlet boundary conditions via domain decomposition. This method incorporates FEM-based representations near the boundary, ensuring exact enforcement without compromising convergence. Through six experiments of increasing complexity, PINN-FEM outperforms standard PINN models, showcasing superior accuracy and robustness. While distance functions and similar techniques have been proposed for boundary condition enforcement, they lack generality for real-world applications. PINN-FEM bridges this gap by leveraging FEM near boundaries, making it well-suited for industrial and scientific problems.

[LG-30] Scaling Up ESM2 Architectures for Long Protein Sequences Analysis: Long and Quantized Approaches

链接: https://arxiv.org/abs/2501.07747
作者: Gabriel Bianchin de Oliveira,Helio Pedrini,Zanoni Dias
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Various approaches utilizing Transformer architectures have achieved state-of-the-art results in Natural Language Processing (NLP). Based on this success, numerous architectures have been proposed for other types of data, such as in biology, particularly for protein sequences. Notably among these are the ESM2 architectures, pre-trained on billions of proteins, which form the basis of various state-of-the-art approaches in the field. However, the ESM2 architectures have a limitation regarding input size, restricting it to 1,022 amino acids, which necessitates the use of preprocessing techniques to handle sequences longer than this limit. In this paper, we present the long and quantized versions of the ESM2 architectures, doubling the input size limit to 2,048 amino acids.

[LG-31] HyperQuery: Beyond Binary Link Prediction

链接: https://arxiv.org/abs/2501.07731
作者: Sepideh Maleki,Josh Vekhter,Keshav Pingali
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Groups with complex set intersection relations are a natural way to model a wide array of data, from the formation of social groups to the complex protein interactions which form the basis of biological life. One approach to representing such higher order relationships is as a hypergraph. However, efforts to apply machine learning techniques to hypergraph structured datasets have been limited thus far. In this paper, we address the problem of link prediction in knowledge hypergraphs as well as simple hypergraphs and develop a novel, simple, and effective optimization architecture that addresses both tasks. Additionally, we introduce a novel feature extraction technique using node level clustering and we show how integrating data from node-level labels can improve system performance. Our self-supervised approach achieves significant improvement over state of the art baselines on several hyperedge prediction and knowledge hypergraph completion benchmarks.

[LG-32] Autoencoded UMAP-Enhanced Clustering for Unsupervised Learning

链接: https://arxiv.org/abs/2501.07729
作者: Malihehsadat Chavooshi,Alexander V. Mamonov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel approach to unsupervised learning by constructing a non-linear embedding of the data into a low-dimensional space followed by any conventional clustering algorithm. The embedding promotes clusterability of the data and is comprised of two mappings: the encoder of an autoencoder neural network and the output of UMAP algorithm. The autoencoder is trained with a composite loss function that incorporates both a conventional data reconstruction as a regularization component and a clustering-promoting component built using the spectral graph theory. The two embeddings and the subsequent clustering are integrated into a three-stage unsupervised learning framework, referred to as Autoencoded UMAP-Enhanced Clustering (AUEC). When applied to MNIST data, AUEC significantly outperforms the state-of-the-art techniques in terms of clustering accuracy.

[LG-33] Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks NEURIPS2024

链接: https://arxiv.org/abs/2501.07727
作者: Tianyi Zhang,Linrong Cai,Jeffrey Li,Nicholas Roberts,Neel Guha,Jinoh Lee,Frederic Sala
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:Weak supervision (WS) is a popular approach for label-efficient learning, leveraging diverse sources of noisy but inexpensive weak labels to automatically annotate training data. Despite its wide usage, WS and its practical value are challenging to benchmark due to the many knobs in its setup, including: data sources, labeling functions (LFs), aggregation techniques (called label models), and end model pipelines. Existing evaluation suites tend to be limited, focusing on particular components or specialized use cases. Moreover, they often involve simplistic benchmark tasks or de-facto LF sets that are suboptimally written, producing insights that may not generalize to real-world settings. We address these limitations by introducing a new benchmark, BOXWRENCH, designed to more accurately reflect real-world usages of WS. This benchmark features tasks with (1) higher class cardinality and imbalance, (2) notable domain expertise requirements, and (3) multilingual variations across parallel corpora. For all tasks, LFs are written using a careful procedure aimed at mimicking real-world settings. In contrast to existing WS benchmarks, we show that supervised learning requires substantial amounts (1000+) of labeled examples to match WS in many settings.

[LG-34] An Adaptive Collocation Point Strategy For Physics Informed Neural Networks via the QR Discrete Empirical Interpolation Method

链接: https://arxiv.org/abs/2501.07700
作者: Adrian Celaya,David Fuentes,Beatrice Riviere
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have gained significant attention for solving forward and inverse problems related to partial differential equations (PDEs). While advancements in loss functions and network architectures have improved PINN accuracy, the impact of collocation point sampling on their performance remains underexplored. Fixed sampling methods, such as uniform random sampling and equispaced grids, can fail to capture critical regions with high solution gradients, limiting their effectiveness for complex PDEs. Adaptive methods, inspired by adaptive mesh refinement from traditional numerical methods, address this by dynamically updating collocation points during training but may overlook residual dynamics between updates, potentially losing valuable information. To overcome this limitation, we propose an adaptive collocation point selection strategy utilizing the QR Discrete Empirical Interpolation Method (QR-DEIM), a reduced-order modeling technique for efficiently approximating nonlinear functions. Our results on benchmark PDEs, including the wave, Allen-Cahn, and Burgers’ equations, demonstrate that our QR-DEIM-based approach improves PINN accuracy compared to existing methods, offering a promising direction for adaptive collocation point strategies.

[LG-35] Finite Sample Identification of Partially Observed Bilinear Dynamical Systems

链接: https://arxiv.org/abs/2501.07652
作者: Yahya Sattar,Yassir Jedra,Maryam Fazel,Sarah Dean
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of learning a realization of a partially observed bilinear dynamical system (BLDS) from noisy input-output data. Given a single trajectory of input-output samples, we provide a finite time analysis for learning the system’s Markov-like parameters, from which a balanced realization of the bilinear system can be obtained. Our bilinear system identification algorithm learns the system’s Markov-like parameters by regressing the outputs to highly correlated, nonlinear, and heavy-tailed covariates. Moreover, the stability of BLDS depends on the sequence of inputs used to excite the system. These properties, unique to partially observed bilinear dynamical systems, pose significant challenges to the analysis of our algorithm for learning the unknown dynamics. We address these challenges and provide high probability error bounds on our identification algorithm under a uniform stability assumption. Our analysis provides insights into system theoretic quantities that affect learning accuracy and sample complexity. Lastly, we perform numerical experiments with synthetic data to reinforce these insights.

[LG-36] Kolmogorov-Arnold Networks and Evolutionary Game Theory for More Personalized Cancer Treatment

链接: https://arxiv.org/abs/2501.07611
作者: Sepinoud Azimi,Louise Spekking,Kateřina Staňková
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Personalized cancer treatment is revolutionizing oncology by leveraging precision medicine and advanced computational techniques to tailor therapies to individual patients. Despite its transformative potential, challenges such as limited generalizability, interpretability, and reproducibility of predictive models hinder its integration into clinical practice. Current methodologies often rely on black-box machine learning models, which, while accurate, lack the transparency needed for clinician trust and real-world application. This paper proposes the development of an innovative framework that bridges Kolmogorov-Arnold Networks (KANs) and Evolutionary Game Theory (EGT) to address these limitations. Inspired by the Kolmogorov-Arnold representation theorem, KANs offer interpretable, edge-based neural architectures capable of modeling complex biological systems with unprecedented adaptability. Their integration into the EGT framework enables dynamic modeling of cancer progression and treatment responses. By combining KAN’s computational precision with EGT’s mechanistic insights, this hybrid approach promises to enhance predictive accuracy, scalability, and clinical usability.

[LG-37] An Explainable Pipeline for Machine Learning with Functional Data

链接: https://arxiv.org/abs/2501.07602
作者: Katherine Goode,J. Derek Tucker,Daniel Ries,Heike Hofmann
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models have shown success in applications with an objective of prediction, but the algorithmic complexity of some models makes them difficult to interpret. Methods have been proposed to provide insight into these “black-box” models, but there is little research that focuses on supervised ML when the model inputs are functional data. In this work, we consider two applications from high-consequence spaces with objectives of making predictions using functional data inputs. One application aims to classify material types to identify explosive materials given hyperspectral computed tomography scans of the materials. The other application considers the forensics science task of connecting an inkjet printed document to the source printer using color signatures extracted by Raman spectroscopy. An instinctive route to consider for analyzing these data is a data driven ML model for classification, but due to the high consequence nature of the applications, we argue it is important to appropriately account for the nature of the data in the analysis to not obscure or misrepresent patterns. As such, we propose the Variable importance Explainable Elastic Shape Analysis (VEESA) pipeline for training ML models with functional data that (1) accounts for the vertical and horizontal variability in the functional data and (2) provides an explanation in the original data space of how the model uses variability in the functional data for prediction. The pipeline makes use of elastic functional principal components analysis (efPCA) to generate uncorrelated model inputs and permutation feature importance (PFI) to identify the principal components important for prediction. The variability captured by the important principal components in visualized the original data space. We ultimately discuss ideas for natural extensions of the VEESA pipeline and challenges for future research.

[LG-38] Analyzing Spatio-Temporal Dynamics of Dissolved Oxygen for the River Thames using Superstatistical Methods and Machine Learning

链接: https://arxiv.org/abs/2501.07599
作者: Hankun He,Takuya Boehringer,Benjamin Schäfer,Kate Heppell,Christian Beck
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:By employing superstatistical methods and machine learning, we analyze time series data of water quality indicators for the River Thames, with a specific focus on the dynamics of dissolved oxygen. After detrending, the probability density functions of dissolved oxygen fluctuations exhibit heavy tails that are effectively modeled using q -Gaussian distributions. Our findings indicate that the multiplicative Empirical Mode Decomposition method stands out as the most effective detrending technique, yielding the highest log-likelihood in nearly all fittings. We also observe that the optimally fitted width parameter of the q -Gaussian shows a negative correlation with the distance to the sea, highlighting the influence of geographical factors on water quality dynamics. In the context of same-time prediction of dissolved oxygen, regression analysis incorporating various water quality indicators and temporal features identify the Light Gradient Boosting Machine as the best model. SHapley Additive exPlanations reveal that temperature, pH, and time of year play crucial roles in the predictions. Furthermore, we use the Transformer to forecast dissolved oxygen concentrations. For long-term forecasting, the Informer model consistently delivers superior performance, achieving the lowest MAE and SMAPE with the 192 historical time steps that we used. This performance is attributed to the Informer’s ProbSparse self-attention mechanism, which allows it to capture long-range dependencies in time-series data more effectively than other machine learning models. It effectively recognizes the half-life cycle of dissolved oxygen, with particular attention to key intervals. Our findings provide valuable insights for policymakers involved in ecological health assessments, aiding in accurate predictions of river water quality and the maintenance of healthy aquatic ecosystems.

[LG-39] Automated Heterogeneous Network learning with Non-Recursive Message Passing

链接: https://arxiv.org/abs/2501.07598
作者: Zhaoqing Li,Maiqi Jiang,Shengyuan Chen,Bo Li,Guorong Chen,Xiao Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous information networks (HINs) can be used to model various real-world systems. As HINs consist of multiple types of nodes, edges, and node features, it is nontrivial to directly apply graph neural network (GNN) techniques in heterogeneous cases. There are two remaining major challenges. First, homogeneous message passing in a recursive manner neglects the distinct types of nodes and edges in different hops, leading to unnecessary information mixing. This often results in the incorporation of ``noise’’ from uncorrelated intermediate neighbors, thereby degrading performance. Second, feature learning should be handled differently for different types, which is challenging especially when the type sizes are large. To bridge this gap, we develop a novel framework - AutoGNR, to directly utilize and automatically extract effective heterogeneous information. Instead of recursive homogeneous message passing, we introduce a non-recursive message passing mechanism for GNN to mitigate noise from uncorrelated node types in HINs. Furthermore, under the non-recursive framework, we manage to efficiently perform neural architecture search for an optimal GNN structure in a differentiable way, which can automatically define the heterogeneous paths for aggregation. Our tailored search space encompasses more effective candidates while maintaining a tractable size. Experiments show that AutoGNR consistently outperforms state-of-the-art methods on both normal and large scale real-world HIN datasets.

[LG-40] Learning-based Detection of GPS Spoofing Attack for Quadrotors

链接: https://arxiv.org/abs/2501.07597
作者: Pengyu Wang,Zhaohua Yang,Jialu Li,Ling Shi
类目: Robotics (cs.RO); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted in IEEE Industrial Electronics Society Annual Online Conference

点击查看摘要

Abstract:Safety-critical cyber-physical systems (CPS), such as quadrotor UAVs, are particularly prone to cyber attacks, which can result in significant consequences if not detected promptly and accurately. During outdoor operations, the nonlinear dynamics of UAV systems, combined with non-Gaussian noise, pose challenges to the effectiveness of conventional statistical and machine learning methods. To overcome these limitations, we present QUADFormer, an advanced attack detection framework for quadrotor UAVs leveraging a transformer-based architecture. This framework features a residue generator that produces sequences sensitive to anomalies, which are then analyzed by the transformer to capture statistical patterns for detection and classification. Furthermore, an alert mechanism ensures UAVs can operate safely even when under attack. Extensive simulations and experimental evaluations highlight that QUADFormer outperforms existing state-of-the-art techniques in detection accuracy.

[LG-41] A Multi-Layer CNN-GRUSKIP model based on transformer for spatial TEMPORAL traffic flow prediction

链接: https://arxiv.org/abs/2501.07593
作者: Karimeh Ibrahim Mohammad Ata,Mohd Khair Hassan,Ayad Ghany Ismaeel,Syed Abdul Rahman Al-Haddad,Thamer Alquthami,Sameer Alani
类目: Machine Learning (cs.LG)
*备注: 17 Pages, 18 Figures, 6 Tables

点击查看摘要

Abstract:Traffic flow prediction remains a cornerstone for intelligent transportation systems ITS, influencing both route optimization and environmental efforts. While Recurrent Neural Networks RNN and traditional Convolutional Neural Networks CNN offer some insights into the spatial temporal dynamics of traffic data, they are often limited when navigating sparse and extended spatial temporal patterns. In response, the CNN-GRUSKIP model emerges as a pioneering approach. Notably, it integrates the GRU-SKIP mechanism, a hybrid model that leverages the Gate Recurrent Unit of GRU capabilities to process sequences with the SKIP feature of ability to bypass and connect longer temporal dependencies, making it especially potent for traffic flow predictions with erratic and extended patterns. Another distinctive aspect is its non-standard 6-layer CNN, meticulously designed for in-depth spatiotemporal correlation extraction. The model comprises (1) the specialized CNN feature extraction, (2) the GRU-SKIP enhanced long-temporal module adept at capturing extended patterns, (3) a transformer module employing encoder-decoder and multi-attention mechanisms to hone prediction accuracy and trim model complexity, and (4) a bespoke prediction module. When tested against real-world datasets from California of Caltrans Performance Measurement System PeMS, specifically PeMS districts 4 and 8, the CNN-GRUSKIP consistently outperformed established models such as ARIMA, Graph Wave Net, HA, LSTM, STGCN, and APTN. With its potent predictive prowess and adaptive architecture, the CNN-GRUSKIP model stands to redefine ITS applications, especially where nuanced traffic dynamics are in play.

[LG-42] Avoiding subtraction and division of stochastic signals using normalizing flows: NFdeconvolve

链接: https://arxiv.org/abs/2501.08288
作者: Pedro Pessoa,Max Schweiger,Lance W.Q. Xu,Tristan Manha,Ayush Saurabh,Julian Antolin Camarena,Steve Pressé
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Data Analysis, Statistics and Probability (physics.data-an); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Across the scientific realm, we find ourselves subtracting or dividing stochastic signals. For instance, consider a stochastic realization, x , generated from the addition or multiplication of two stochastic signals a and b , namely x=a+b or x = ab . For the x=a+b example, a can be fluorescence background and b the signal of interest whose statistics are to be learned from the measured x . Similarly, when writing x=ab , a can be thought of as the illumination intensity and b the density of fluorescent molecules of interest. Yet dividing or subtracting stochastic signals amplifies noise, and we ask instead whether, using the statistics of a and the measurement of x as input, we can recover the statistics of b . Here, we show how normalizing flows can generate an approximation of the probability distribution over b , thereby avoiding subtraction or division altogether. This method is implemented in our software package, NFdeconvolve, available on GitHub with a tutorial linked in the main text.

[LG-43] Data-driven system identification using quadratic embeddings of nonlinear dynamics

链接: https://arxiv.org/abs/2501.08202
作者: Stefan Klus,Joel-Pascal N’Konzi
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a novel data-driven method called QENDy (Quadratic Embedding of Nonlinear Dynamics) that not only allows us to learn quadratic representations of highly nonlinear dynamical systems, but also to identify the governing equations. The approach is based on an embedding of the system into a higher-dimensional feature space in which the dynamics become quadratic. Just like SINDy (Sparse Identification of Nonlinear Dynamics), our method requires trajectory data, time derivatives for the training data points, which can also be estimated using finite difference approximations, and a set of preselected basis functions, called dictionary. We illustrate the efficacy and accuracy of QENDy with the aid of various benchmark problems and compare its performance with SINDy and a deep learning method for identifying quadratic embeddings. Furthermore, we analyze the convergence of QENDy and SINDy in the infinite data limit, highlight their similarities and main differences, and compare the quadratic embedding with linearization techniques based on the Koopman operator.

[LG-44] Globally Convergent Variational Inference NEURIPS2024

链接: https://arxiv.org/abs/2501.08201
作者: Declan McNamara,Jackson Loper,Jeffrey Regier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:In variational inference (VI), an approximation of the posterior distribution is selected from a family of distributions through numerical optimization. With the most common variational objective function, known as the evidence lower bound (ELBO), only convergence to a local optimum can be guaranteed. In this work, we instead establish the global convergence of a particular VI method. This VI method, which may be considered an instance of neural posterior estimation (NPE), minimizes an expectation of the inclusive (forward) KL divergence to fit a variational distribution that is parameterized by a neural network. Our convergence result relies on the neural tangent kernel (NTK) to characterize the gradient dynamics that arise from considering the variational objective in function space. In the asymptotic regime of a fixed, positive-definite neural tangent kernel, we establish conditions under which the variational objective admits a unique solution in a reproducing kernel Hilbert space (RKHS). Then, we show that the gradient descent dynamics in function space converge to this unique function. In ablation studies and practical problems, we demonstrate that our results explain the behavior of NPE in non-asymptotic finite-neuron settings, and show that NPE outperforms ELBO-based optimization, which often converges to shallow local optima.

[LG-45] On the use of Statistical Learning Theory for model selection in Structural Health Monitoring

链接: https://arxiv.org/abs/2501.08050
作者: C. A. Lindley,N. Dervilis,K. Worden
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Whenever data-based systems are employed in engineering applications, defining an optimal statistical representation is subject to the problem of model selection. This paper focusses on how well models can generalise in Structural Health Monitoring (SHM). Although statistical model validation in this field is often performed heuristically, it is possible to estimate generalisation more rigorously using the bounds provided by Statistical Learning Theory (SLT). Therefore, this paper explores the selection process of a kernel smoother for modelling the impulse response of a linear oscillator from the perspective of SLT. It is demonstrated that incorporating domain knowledge into the regression problem yields a lower guaranteed risk, thereby enhancing generalisation.

[LG-46] Gen-A: Generalizing Ambisonics Neural Encoding to Unseen Microphone Arrays

链接: https://arxiv.org/abs/2501.08047
作者: Mikko Heikkinen,Archontis Politis,Konstantinos Drossos,Tuomas Virtanen
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted for publication in Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing

点击查看摘要

Abstract:Using deep neural networks (DNNs) for encoding of microphone array (MA) signals to the Ambisonics spatial audio format can surpass certain limitations of established conventional methods, but existing DNN-based methods need to be trained separately for each MA. This paper proposes a DNN-based method for Ambisonics encoding that can generalize to arbitrary MA geometries unseen during training. The method takes as inputs the MA geometry and MA signals and uses a multi-level encoder consisting of separate paths for geometry and signal data, where geometry features inform the signal encoder at each level. The method is validated in simulated anechoic and reverberant conditions with one and two sources. The results indicate improvement over conventional encoding across the whole frequency range for dry scenes, while for reverberant scenes the improvement is frequency-dependent.

[LG-47] Concentration of Measure for Distributions Generated via Diffusion Models

链接: https://arxiv.org/abs/2501.07741
作者: Reza Ghane,Anthony Bao,Danil Akhtiamov,Babak Hassibi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show via a combination of mathematical arguments and empirical evidence that data distributions sampled from diffusion models satisfy a Concentration of Measure Property saying that any Lipschitz 1 -dimensional projection of a random vector is not too far from its mean with high probability. This implies that such models are quite restrictive and gives an explanation for a fact previously observed in arXiv:2410.14171 that conventional diffusion models cannot capture “heavy-tailed” data (i.e. data \mathbfx for which the norm |\mathbfx|_2 does not possess a subgaussian tail) well. We then proceed to train a generalized linear model using stochastic gradient descent (SGD) on the diffusion-generated data for a multiclass classification task and observe empirically that a Gaussian universality result holds for the test error. In other words, the test error depends only on the first and second order statistics of the diffusion-generated data in the linear setting. Results of such forms are desirable because they allow one to assume the data itself is Gaussian for analyzing performance of the trained classifier. Finally, we note that current approaches to proving universality do not apply to this case as the covariance matrices of the data tend to have vanishing minimum singular values for the diffusion-generated data, while the current proofs assume that this is not the case (see Subsection 3.4 for more details). This leaves extending previous mathematical universality results as an intriguing open question. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2501.07741 [stat.ML] (or arXiv:2501.07741v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.07741 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] Multi-megabase scale genome interpretation with genetic language models

链接: https://arxiv.org/abs/2501.07737
作者: Frederik Träuble,Lachlan Stuart,Andreas Georgiou,Pascal Notin,Arash Mehrjou,Ron Schwessinger,Mathieu Chevalley,Kim Branson,Bernhard Schölkopf,Cornelia van Duijn,Debora Marks,Patrick Schwab
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales – spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.

[LG-49] A Step Toward Interpretability: Smearing the Likelihood

链接: https://arxiv.org/abs/2501.07643
作者: Andrew J. Larkoski
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Machine Learning (stat.ML)
*备注: 16+1 pages, 3 figures

点击查看摘要

Abstract:The problem of interpretability of machine learning architecture in particle physics has no agreed-upon definition, much less any proposed solution. We present a first modest step toward these goals by proposing a definition and corresponding practical method for isolation and identification of relevant physical energy scales exploited by the machine. This is accomplished by smearing or averaging over all input events that lie within a prescribed metric energy distance of one another and correspondingly renders any quantity measured on a finite, discrete dataset continuous over the dataspace. Within this approach, we are able to explicitly demonstrate that (approximate) scaling laws are a consequence of extreme value theory applied to analysis of the distribution of the irreducible minimal distance over which a machine must extrapolate given a finite dataset. As an example, we study quark versus gluon jet identification, construct the smeared likelihood, and show that discrimination power steadily increases as resolution decreases, indicating that the true likelihood for the problem is sensitive to emissions at all scales.

信息检索

[IR-0] riMod Fusion for Multimodal Named Entity Recognition in Social Media

链接: https://arxiv.org/abs/2501.08267
作者: Mosab Alfaqeeh
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: Accepted at CASCON

点击查看摘要

Abstract:Social media platforms serve as invaluable sources of user-generated content, offering insights into various aspects of human behavior. Named Entity Recognition (NER) plays a crucial role in analyzing such content by identifying and categorizing named entities into predefined classes. However, traditional NER models often struggle with the informal, contextually sparse, and ambiguous nature of social media language. To address these challenges, recent research has focused on multimodal approaches that leverage both textual and visual cues for enhanced entity recognition. Despite advances, existing methods face limitations in capturing nuanced mappings between visual objects and textual entities and addressing distributional disparities between modalities. In this paper, we propose a novel approach that integrates textual, visual, and hashtag features (TriMod), utilizing Transformer-attention for effective modality fusion. The improvements exhibited by our model suggest that named entities can greatly benefit from the auxiliary context provided by multiple modalities, enabling more accurate recognition. Through the experiments on a multimodal social media dataset, we demonstrate the superiority of our approach over existing state-of-the-art methods, achieving significant improvements in precision, recall, and F1 score.

[IR-1] Unsupervised Query Routing for Retrieval Augmented Generation

链接: https://arxiv.org/abs/2501.07793
作者: Feiteng Mu,Liwen Zhang,Yong Jiang,Wenjie Li,Zhen Zhang,Pengjun Xie,Fei Huang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Query routing for retrieval-augmented generation aims to assign an input query to the most suitable search engine. Existing works rely heavily on supervised datasets that require extensive manual annotation, resulting in high costs and limited scalability, as well as poor generalization to out-of-distribution scenarios. To address these challenges, we introduce a novel unsupervised method that constructs the “upper-bound” response to evaluate the quality of retrieval-augmented responses. This evaluation enables the decision of the most suitable search engine for a given query. By eliminating manual annotations, our approach can automatically process large-scale real user queries and create training data. We conduct extensive experiments across five datasets, demonstrating that our method significantly enhances scalability and generalization capabilities.

[IR-2] Constructing Set-Compositional and Negated Representations for First-Stage Ranking

链接: https://arxiv.org/abs/2501.07679
作者: Antonios Minas Krasakis,Andrew Yates,Evangelos Kanoulas
类目: Information Retrieval (cs.IR)
*备注: 12 pages

点击查看摘要

Abstract:Set compositional and negated queries are crucial for expressing complex information needs and enable the discovery of niche items like Books about non-European monarchs. Despite the recent advances in LLMs, first-stage ranking remains challenging due to the requirement of encoding documents and queries independently from each other. This limitation calls for constructing compositional query representations that encapsulate logical operations or negations, and can be used to match relevant documents effectively. In the first part of this work, we explore constructing such representations in a zero-shot setting using vector operations between lexically grounded Learned Sparse Retrieval (LSR) representations. Specifically, we introduce Disentangled Negation that penalizes only the negated parts of a query, and a Combined Pseudo-Term approach that enhances LSRs ability to handle intersections. We find that our zero-shot approach is competitive and often outperforms retrievers fine-tuned on compositional data, highlighting certain limitations of LSR and Dense Retrievers. Finally, we address some of these limitations and improve LSRs representation power for negation, by allowing them to attribute negative term scores and effectively penalize documents containing the negated terms.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-15

目录

概览 (2025-01-15)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载