Arxiv今日论文 | 2025-01-17

本篇博文主要内容为 2025-01-17 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决基于大语言模型（Large Language Models）的机器写作中，检索增强生成（Retrieval-Augmented Generation）方法所面临的问题。具体而言，传统的检索信息往往缺乏深度、实用性，且存在冗余，导致生成的文章内容浅显、重复且缺乏原创性。为解决这些问题，论文提出了OmniThink框架，其核心思想是模拟人类学习者在逐步深化对主题理解时的认知行为，通过迭代扩展和反思的过程来生成内容。实验结果表明，OmniThink在不影响文章连贯性和深度的前提下，显著提高了生成文章的知识密度。人类评估和专家反馈进一步验证了OmniThink在生成长篇内容时应对实际挑战的潜力。

链接: https://arxiv.org/abs/2501.09751
作者: Zekun Xi,Wenbiao Yin,Jizhan Fang,Jialong Wu,Runnan Fang,Ningyu Zhang,Jiang Yong,Pengjun Xie,Fei Huang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model’s predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, utility, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, repetitive, and unoriginal outputs. To address these issues, we propose OmniThink, a machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they progressively deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles.
zh

[NLP-1] Enhancing Lexicon-Based Text Embeddings with Large Language Models

【速读】：该论文试图解决传统因果大语言模型（LLMs）中存在的词汇冗余问题和单向注意力机制的限制。传统方法主要依赖于密集嵌入（dense embeddings），而本文提出了一种基于词汇的嵌入方法（Lexicon-based EmbeddiNgS, LENS），通过词汇嵌入聚类来简化词汇空间，并探索双向注意力和多种池化策略。LENS的关键在于将每个维度分配给特定的词汇簇，语义相似的词汇被分组在一起，从而充分利用LLMs的潜力。实验表明，LENS在Massive Text Embedding Benchmark (MTEB)上优于密集嵌入，特别是在检索子集（BEIR）上，结合LENS与密集嵌入实现了最先进的性能。

链接: https://arxiv.org/abs/2501.09749
作者: Yibin Lei,Tao Shen,Yu Cao,Andrew Yates
机构: University of Amsterdam(阿姆斯特丹大学); University of Technology Sydney(悉尼科技大学); Tencent IEG(腾讯互动娱乐)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
zh

[NLP-2] Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models

【速读】：该论文试图解决机器学习开发者在维护Jupyter笔记本（Jupyter notebooks）时面临的挑战，特别是如何有效地添加新功能或修复错误。由于Jupyter笔记本的长度和复杂性，维护工作通常较为困难，且目前缺乏与开发者编辑行为相关的基准数据集。为此，论文提出了一个包含48,398个Jupyter笔记本编辑操作的数据集，这些编辑操作来自GitHub上792个机器学习仓库的20,095次修订。该数据集捕捉了单元格级别和行级别的修改细节，为理解机器学习工作流中的实际维护模式提供了基础。研究还首次探讨了使用大语言模型（LLMs）预测Jupyter笔记本中的代码编辑，发现尽管较大模型在代码编辑任务上表现优于较小模型，但所有模型在数据集上的准确性仍然较低，即使经过微调。这表明实际机器学习维护任务的复杂性，并强调了上下文信息在提升模型性能中的关键作用。

链接: https://arxiv.org/abs/2501.09745
作者: Bihui Jin,Jiayue Wang,Pengyu Nie
机构: University of Waterloo(滑铁卢大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning developers frequently use interactive computational notebooks, such as Jupyter notebooks, to host code for data processing and model training. Jupyter notebooks provide a convenient tool for writing machine learning pipelines and interactively observing outputs, however, maintaining Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging due to the length and complexity of the notebooks. Moreover, there is no existing benchmark related to developer edits on Jupyter notebooks. To address this, we present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub, and perform the first study of the using LLMs to predict code edits in Jupyter notebooks. Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning workflows. We observed that the edits on Jupyter notebooks are highly localized, with changes averaging only 166 lines of code in repositories. While larger models outperform smaller counterparts in code editing, all models have low accuracy on our dataset even after finetuning, demonstrating the complexity of real-world machine learning maintenance tasks. Our findings emphasize the critical role of contextual information in improving model performance and point toward promising avenues for advancing large language models’ capabilities in engineering machine learning code.
zh

[NLP-3] Attention based Bidirectional GRU hybrid model for inappropriate content detection in Urdu language

【速读】：该论文试图解决在乌尔都语（Urdu）社交媒体文本中识别不当内容（inappropriate content）的问题。乌尔都语的拼写不唯一，且常与其他语言（如英语）混合使用，这增加了文本处理的复杂性。现有的深度学习技术在乌尔都语中的应用较少，尤其是在处理不当内容识别方面。论文提出了一种基于注意力机制（attention layer）的双向门控循环单元（Bidirectional GRU, BiGRU）混合模型，旨在通过注意力层处理长期依赖关系，提高模型效率。研究对比了四种基线深度学习模型（LSTM、Bi-LSTM、GRU和TCN），并评估了预训练词嵌入（pre-trained word2Vec embeddings）对模型性能的影响。实验结果表明，提出的BiGRU-A模型在不使用预训练词嵌入的情况下达到了84%的准确率，优于其他基线模型，且注意力层显著提升了模型效率。

链接: https://arxiv.org/abs/2501.09722
作者: Ezzah Shoukat,Rabia Irfan,Iqra Basharat,Muhammad Ali Tahir,Sameen Shaukat
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the increased use of the internet and social networks for online discussions, the spread of toxic and inappropriate content on social networking sites has also increased. Several studies have been conducted in different languages. However, there is less work done for South Asian languages for inappropriate content identification using deep learning techniques. In Urdu language, the spellings are not unique, and people write different common spellings for the same word, while mixing it other languages, like English in the text makes it more challenging, and limited research work is available to process such language with the finest algorithms. The use of attention layer with a deep learning model can help handling the long-term dependencies and increase its efficiency . To explore the effects of the attention layer, this study proposes attention-based Bidirectional GRU hybrid model for identifying inappropriate content in Urdu Unicode text language. Four different baseline deep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare the performance of the proposed model. The results of these models were compared based on evaluation metrics, dataset size, and impact of the word embedding layer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Our proposed model BiGRU-A outperformed all other baseline models by yielding 84% accuracy without using pre-trained word2Vec layer. From our experiments, we have established that the attention layer improves the model’s efficiency, and pre-trained word2Vec embedding does not work well with an inappropriate content dataset.
zh

[NLP-4] Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text

【速读】：该论文旨在系统评估12种机器学习模型及其变体在检测经济意识形态（economic ideology）方面的能力。研究使用英国六次选举的宣言数据作为评估基准，这些数据由专家和众包编码员预先标注。解决方案的关键在于比较生成式模型（如GPT-4o和Gemini 1.5 Flash）、微调模型（fine-tuned models）和零样本模型（zero-shot models）在细粒度和聚合层面的表现。结果表明，生成式模型在所有基准测试中表现最优，但其可访问性和资源需求较高；微调模型通过领域特定优化提供了可靠的替代方案，但其对训练数据的依赖限制了可扩展性；零样本模型在识别经济意识形态信号时表现较差，且与人工编码结果存在负相关。研究还揭示了政党内部差异显著、微调模型受益于更大规模的训练数据、以及零样本模型对提示内容敏感等关键发现，并提出了政治内容自动化分析的最佳实践。

链接: https://arxiv.org/abs/2501.09719
作者: Jihed Ncib
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology. As an evaluation benchmark, I use manifesto data spanning six elections in the United Kingdom and pre-annotated by expert and crowd coders. The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels. The results show that generative models such as GPT-4o and Gemini 1.5 Flash consistently outperform other models against all benchmarks. However, they pose issues of accessibility and resource availability. Fine-tuning yielded competitive performance and offers a reliable alternative through domain-specific optimization. But its dependency on training data severely limits scalability. Zero-shot models consistently face difficulties with identifying signals of economic ideology, often resulting in negative associations with human coding. Using general knowledge for the domain-specific task of ideology scaling proved to be unreliable. Other key findings include considerable within-party variation, fine-tuning benefiting from larger training data, and zero-shot’s sensitivity to prompt content. The assessments include the strengths and limitations of each model and derive best-practices for automated analyses of political content.
zh

[NLP-5] Domain Adaptation of Foundation LLM s for e-Commerce

【速读】：该论文旨在解决如何将大型语言模型（Large Language Models, LLMs）适应于电子商务（e-commerce）领域的问题。具体来说，作者提出了e-Llama模型，包括8亿和700亿参数的版本，这些模型通过对Llama 3.1基础模型在1万亿个领域特定数据（domain-specific data）上进行持续预训练（continuous pretraining）来实现。解决方案的关键在于通过一系列消融实验（ablation studies）来选择和优化超参数（hyperparameters），以确保模型在电子商务领域的适应能力，同时不显著牺牲其在通用领域任务上的性能。此外，作者还探讨了将适应后的模型与基础模型合并的可能性，以更好地控制不同领域之间的性能权衡。

链接: https://arxiv.org/abs/2501.09706
作者: Christian Herold,Michael Kozielski,Tala Bazazo,Pavel Petrushkov,Hadi Hashemi,Patrycja Cieplicka,Dominika Basaj,Shahram Khadivi
机构: eBay Inc. (eBay公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the e-Llama models: 8 billion and 70 billion parameter large language models that are adapted towards the e-commerce domain. These models are meant as foundation models with deep knowledge about e-commerce, that form a base for instruction- and fine-tuning. The e-Llama models are obtained by continuously pretraining the Llama 3.1 base models on 1 trillion tokens of domain-specific data. We discuss our approach and motivate our choice of hyperparameters with a series of ablation studies. To quantify how well the models have been adapted to the e-commerce domain, we define and implement a set of multilingual, e-commerce specific evaluation tasks. We show that, when carefully choosing the training setup, the Llama 3.1 models can be adapted towards the new domain without sacrificing significant performance on general domain tasks. We also explore the possibility of merging the adapted model and the base model for a better control of the performance trade-off between domains. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.09706 [cs.CL] (or arXiv:2501.09706v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.09706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-6] owards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

【速读】：该论文旨在探讨如何利用大语言模型（LLMs）解决复杂推理任务，并提出了通过引入“思维”（thought）概念来增强模型推理能力的新范式。关键解决方案包括：1）在训练阶段，通过强化学习（RL）自动生成高质量的推理轨迹，从而扩展模型的推理能力；2）在测试阶段，鼓励模型使用更多“思维”标记（tokens）进行推理，以显著提高推理准确性。这些方法共同推动了大规模推理模型（Large Reasoning Model）的发展，标志着该领域的一个重要里程碑。

链接: https://arxiv.org/abs/2501.09686
作者: Fengli Xu,Qianyue Hao,Zefang Zong,Jingwei Wang,Yunke Zhang,Jingyi Wang,Xiaochong Lan,Jiahui Gong,Tianjian Ouyang,Fanjin Meng,Chenyang Shao,Yuwei Yan,Qinglong Yang,Yiwen Song,Sijian Ren,Xinyuan Hu,Yu Li,Jie Feng,Chen Gao,Yong Li
机构: Tsinghua University(清华大学); HKUST (GZ)(香港科技大学（广州）); Emory University(埃默里大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 5 figures

点击查看摘要

Abstract:Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of “thought” – a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs’ to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs’ reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to “think” with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier – a path toward Large Reasoning Model. The introduction of OpenAI’s o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
zh

[NLP-7] he Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

【速读】：该论文试图解决由于大型语言模型（Large Language Models）的流行导致代码数据集被广泛使用，从而在下游行为研究或模型评估中面临数据污染（data contamination）的问题。为了解决这一问题，作者提出了一个名为The Heap的大型多语言代码数据集，覆盖57种编程语言，并已与其他开源代码数据集进行了去重处理。这一解决方案的关键在于提供了一个经过严格去重的数据集，使研究人员能够在不进行大量数据清洗的情况下，公平地评估大型语言模型。

链接: https://arxiv.org/abs/2501.09653
作者: Jonathan Katzy,Razvan Mihai Popescu,Arie van Deursen,Maliheh Izadi
机构: Delft University of Technology (代尔夫特理工大学); Delft University of Technology (代尔夫特理工大学); Delft University of Technology (代尔夫特理工大学); Delft University of Technology (代尔夫特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Pre-Print. Accepted to FORGE 2025 Dataset Track

点击查看摘要

Abstract:The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.
zh

[NLP-8] CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding COLING2025

【速读】：该论文试图解决当前语音助手系统中用户偏好（user preferences）难以长期保留的问题，以及由此导致的用户重复请求和参与度下降的现象。此外，论文还关注了工业应用中用户偏好提取的不透明性和隐私问题，特别是在欧洲等有严格隐私法规的地区。为解决这些问题，论文提出了一种基于预定义类别的长期记忆系统（long-term memory system），利用大语言模型（Large Language Models）高效地提取、存储和检索用户偏好，确保个性化和透明度。关键解决方案包括引入一个合成的多轮、多会话对话数据集（CarMem），该系统在偏好提取的F1分数达到0.78至0.95，并通过维护策略减少了95%的冗余偏好和92%的矛盾偏好，检索准确率达到0.87。这些结果表明该系统适用于工业应用。

链接: https://arxiv.org/abs/2501.09645
作者: Johannes Kirmayr,Lukas Stappen,Phillip Schneider,Florian Matthes,Elisabeth André
机构: BMW Group Research and Technology, Munich, Germany(宝马集团研究与技术, 慕尼黑, 德国); Chair for Human-Centered Artificial Intelligence, University of Augsburg, Germany(以人为本的人工智能研究所, 奥格斯堡大学, 德国); Chair for Software Engineering for Business Information Systems, Technical University of Munich, Germany(商业信息系统软件工程研究所, 慕尼黑工业大学, 德国)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted for presentation at the International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:In today’s assistant landscape, personalisation enhances interactions, fosters long-term relationships, and deepens engagement. However, many systems struggle with retaining user preferences, leading to repetitive user requests and disengagement. Furthermore, the unregulated and opaque extraction of user preferences in industry applications raises significant concerns about privacy and trust, especially in regions with stringent regulations like Europe. In response to these challenges, we propose a long-term memory system for voice assistants, structured around predefined categories. This approach leverages Large Language Models to efficiently extract, store, and retrieve preferences within these categories, ensuring both personalisation and transparency. We also introduce a synthetic multi-turn, multi-session conversation dataset (CarMem), grounded in real industry data, tailored to an in-car voice assistant setting. Benchmarked on the dataset, our system achieves an F1-score of .78 to .95 in preference extraction, depending on category granularity. Our maintenance strategy reduces redundant preferences by 95% and contradictory ones by 92%, while the accuracy of optimal retrieval is at .87. Collectively, the results demonstrate the system’s suitability for industrial applications.
zh

[NLP-9] From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLM s

【速读】：该论文旨在解决低资源语言（如孟加拉语）中虚假新闻检测的挑战，特别是在缺乏足够数据集和检测工具的情况下。尽管手动事实核查准确，但其成本高且速度慢，难以有效阻止虚假新闻的传播。为此，作者提出了BanFakeNews-2.0，这是一个增强孟加拉语虚假新闻检测的鲁棒数据集。该数据集新增了11,700篇经过精心筛选的虚假新闻文章，并通过可信来源验证，最终构建了一个包含47,000篇真实新闻和13,000篇虚假新闻的均衡数据集，涵盖13个类别。此外，作者还创建了一个独立的测试集，包含460篇虚假新闻和540篇真实新闻，用于严格评估。解决方案的关键在于利用基于Transformer的架构，包括微调的双向编码器表示（Bidirectional Encoder Representations from Transformers, BERT）变体（F1-87%）和量化低秩近似的大语言模型（F1-89%），这些方法显著优于传统方法。BanFakeNews-2.0为低资源语言的虚假新闻检测研究和应用提供了宝贵的资源，并公开了数据集和模型以促进相关研究。

链接: https://arxiv.org/abs/2501.09604
作者: Hrithik Majumdar Shibu,Shrestha Datta,Md. Sumon Miah,Nasrullah Sami,Mahruba Sharmin Chowdhury,Md. Saiful Islam
机构: Shahjalal University of Science and Technology, Sylhet, Bangladesh (沙贾拉尔科技大学, 锡尔赫特, 孟加拉国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87%) and Large Language Models with Quantized Low-Rank Approximation (F1-89%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on Github to foster research in this direction.
zh

[NLP-10] Stylomech: Unveiling Authorship via Computational Stylometry in English and Romanized Sinhala

【速读】：该论文旨在解决在Web 2.0时代背景下，由于社会伦理缺失导致的内容侵权问题，特别是作者身份识别（Author Attribution）在英语和罗马化僧伽罗语（Romanized Sinhala）中的需求。随着全球通信技术的发展，内容侵权现象显著增加，版权声明和作者身份识别变得尤为重要。论文提出的解决方案关键在于开发一种独特的作者归属系统，该系统仅需比较两组文本：嫌疑作者文本和匿名文本，而不依赖于传统方法中常用的大规模语料库。通过将相同和不同作者的文本对进行数值化表示，模型能够基于这些表示进行训练，从而适用于多种作者和语境，前提是嫌疑作者文本和匿名文本的质量合理。这一方法不仅扩展了作者归属的应用范围，涵盖多种语言环境，还为数字通信中的信任和问责机制提供了支持，特别是在斯里兰卡的背景下。该研究在英语和罗马化僧伽罗语的作者归属领域提出了开创性的方法，满足了数字时代内容验证和知识产权保护的关键需求。

链接: https://arxiv.org/abs/2501.09561
作者: Nabeelah Faumi,Adeepa Gunathilake,Benura Wickramanayake,Deelaka Dias,TGDK Sumanathilaka
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3 figure, 1 image

点击查看摘要

Abstract:With the advent of Web 2.0, the development in social technology coupled with global communication systematically brought positive and negative impacts to society. Copyright claims and Author identification are deemed crucial as there has been a considerable amount of increase in content violation owing to the lack of proper ethics in society. The Author’s attribution in both English and Romanized Sinhala became a major requirement in the last few decades. As an area largely unexplored, particularly within the context of Romanized Sinhala, the research contributes significantly to the field of computational linguistics. The proposed author attribution system offers a unique approach, allowing for the comparison of only two sets of text: suspect author and anonymous text, a departure from traditional methodologies which often rely on larger corpora. This work focuses on using the numerical representation of various pairs of the same and different authors allowing for, the model to train on these representations as opposed to text, this allows for it to apply to a multitude of authors and contexts, given that the suspected author text, and the anonymous text are of reasonable quality. By expanding the scope of authorship attribution to encompass diverse linguistic contexts, the work contributes to fostering trust and accountability in digital communication, especially in Sri Lanka. This research presents a pioneering approach to author attribution in both English and Romanized Sinhala, addressing a critical need for content verification and intellectual property rights enforcement in the digital age.
zh

[NLP-11] Analyzing Continuous Semantic Shifts with Diachronic Word Similarity Matrices COLING2025

【速读】：该论文旨在解决如何更好地理解和分析词语在不同时间段内的语义变化（semantic shift）问题。传统方法仅检测相邻时间段之间的变化点，无法全面捕捉语义变化的细节，而基于BERT的方法虽然能分析词义比例，但计算成本较高。为此，论文提出了一种简单而直观的框架，通过利用同一词语在不同时间段的词嵌入（embeddings）之间的相似性矩阵（similarity matrix），计算跨任意时间段的历时词相似性矩阵（diachronic word similarity matrix）。这种方法使用快速且轻量级的词嵌入，能够更深入地分析连续的语义变化。此外，通过对不同词语的相似性矩阵进行聚类，可以在无监督的情况下将表现出相似语义变化行为的词语进行分类。该解决方案的关键在于利用相似性矩阵和聚类方法，实现了对多时间段语义变化的有效分析和分类。

链接: https://arxiv.org/abs/2501.09538
作者: Hajime Kiyama,Taichi Aida,Mamoru Komachi,Toshinobu Ogiso,Hiroya Takamura,Daichi Mochihashi
机构: Tokyo Metropolitan University(东京都立大学); Hitotsubashi University(一桥大学); National Institute for Japanese Language and Linguistics(国立国语研究所); National Institute of Advanced Industrial Science and Technology(产业技术综合研究所); The Institute of Statistical Mathematics(统计数理研究所)
类目: Computation and Language (cs.CL)
备注: COLING2025

点击查看摘要

Abstract:The meanings and relationships of words shift over time. This phenomenon is referred to as semantic this http URL focused on understanding how semantic shifts occur over multiple time periods is essential for gaining a detailed understanding of semantic this http URL, detecting change points only between adjacent time periods is insufficient for analyzing detailed semantic shifts, and using BERT-based methods to examine word sense proportions incurs a high computational this http URL address those issues, we propose a simple yet intuitive framework for how semantic shifts occur over multiple time periods by leveraging a similarity matrix between the embeddings of the same word through this http URL compute a diachronic word similarity matrix using fast and lightweight word embeddings across arbitrary time periods, making it deeper to analyze continuous semantic this http URL, by clustering the similarity matrices for different words, we can categorize words that exhibit similar behavior of semantic shift in an unsupervised manner.
zh

[NLP-12] Confidence Estimation for Error Detection in Text-to-SQL Systems AAAI2025

【速读】：该论文试图解决Text-to-SQL系统中存在的两个主要挑战：一是模型在面对多样化查询时的泛化能力不足，二是模型预测的可解释性和置信度问题。为解决这些问题，研究提出了将选择性分类器（selective classifiers）集成到Text-to-SQL系统中的方案。关键解决方案包括：1）使用基于熵的置信度估计（entropy-based confidence estimation）来权衡覆盖率和风险，从而提升模型的整体性能；2）通过校准技术（calibration techniques）改善模型的初始校准，使模型的置信度与准确性更好地对齐。实验结果表明，T5模型在校准方面优于GPT-4和Llama 3，且基于熵的选择性分类器在错误检测方面表现更优，尤其能够有效检测与无关问题相关的错误。

链接: https://arxiv.org/abs/2501.09527
作者: Oleg Somov,Elena Tutubalina
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 11 figures, to be published in AAAI 2025 Proceedings

点击查看摘要

Abstract:Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models’ initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
zh

[NLP-13] Augmenting a Large Language Model with a Combination of Text and Visual Data for Conversational Visualization of Global Geospatial Data

【速读】：该论文旨在解决大型语言模型（LLM）在处理科学数据可视化任务时缺乏上下文视觉信息的问题，特别是在视觉数据交互方面的局限性。为了解决这一问题，论文提出了一种方法，通过将可视化的文本描述和数据集与可视化的快照（snapshots）相结合，提取其关键特征并生成高度紧凑的结构化文本文件。这些文件能够在不进行任何微调的情况下，为LLM提供足够的上下文信息，从而实现准确的问答功能。该解决方案的关键在于将文本和视觉数据融合，生成描述性强的结构化文本，从而增强LLM在科学数据可视化任务中的表现。

链接: https://arxiv.org/abs/2501.09521
作者: Omar Mena,Alexandre Kouyoumdjian,Lonni Besançon,Michael Gleicher,Ivan Viola,Anders Ynnerman
机构: King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); Linköping University(林雪平大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data to enable accurate question answering in visualization of scientific data, making conversational visualization possible. LLMs struggle with tasks like visual data interaction, as they lack contextual visual information. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization. We extract their essential features into a structured text file, highly compact, yet descriptive enough to appropriately augment the LLM with contextual information, without any fine-tuning. This approach can be applied to any visualization that is already finally rendered, as long as it is associated with some textual description.
zh

[NLP-14] PIER: A Novel Metric for Evaluating What Matters in Code-Switching ICASSP2025

【速读】：该论文试图解决自动语音识别（Automatic Speech Recognition, ASR）在处理语码转换（code-switching）任务时，现有通用评估指标（如词错误率，Word-Error-Rate, WER）无法准确反映模型性能的问题。语码转换是指在单一话语中交替使用多种语言，这对ASR系统提出了独特的挑战。尽管通过在非语码转换数据上进行微调可以提高传统指标在语码转换测试集上的表现，但实际语码转换词的识别效果反而变差。因此，论文提出了一种新的评估指标——兴趣点错误率（Point-of-Interest Error Rate, PIER），该指标专注于特定兴趣词的识别准确性，能够更精确地评估模型在语码转换任务中的表现，特别是在词间和词内语码转换等复杂场景下的性能。这一解决方案的关键在于通过PIER指标更准确地反映模型在语码转换任务中的实际表现，并为未来研究提供了改进方向。

链接: https://arxiv.org/abs/2501.09512
作者: Enes Yavuz Ugan,Ngoc-Quan Pham,Leonard Bärmann,Alex Waibel
机构: Interactive Systems Lab, Karlsruhe Institut of Technology (KIT), Germany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Code-switching, the alternation of languages within a single discourse, presents a significant challenge for Automatic Speech Recognition. Despite the unique nature of the task, performance is commonly measured with established metrics such as Word-Error-Rate (WER). However, in this paper, we question whether these general metrics accurately assess performance on code-switching. Specifically, using both Connectionist-Temporal-Classification and Encoder-Decoder models, we show fine-tuning on non-code-switched data from both matrix and embedded language improves classical metrics on code-switching test sets, although actual code-switched words worsen (as expected). Therefore, we propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest. We instantiate PIER on code-switched utterances and show that this more accurately describes the code-switching performance, showing huge room for improvement in future work. This focused evaluation allows for a more precise assessment of model performance, particularly in challenging aspects such as inter-word and intra-word code-switching.
zh

[NLP-15] Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators

【速读】：该论文试图解决在线医疗咨询（Online Medical Consultation, OMC）中“问诊”（inquiry）与“诊断”（diagnosis）之间关系不明确的问题。现有的研究大多集中在信息相对充足的情况下提高诊断准确性，而忽视了问诊阶段的重要性。论文通过从真实的医患对话中提取患者互动策略，并利用这些策略指导训练一个高度模拟真实行为的患者模拟器（patient simulator），从而模拟患者对医疗记录的反应。通过大量实验，论文探讨了问诊与诊断之间的关系，发现两者遵循李比希定律（Liebig’s law），即问诊质量差会限制诊断的有效性，反之亦然。此外，论文还将问诊过程分为四类：主诉问诊、已知症状的细化、伴随症状的问诊以及家族或病史的收集，并分析了不同模型在这四类问诊中的表现差异。论文计划开源患者模拟器的权重和相关代码，以促进进一步研究。

链接: https://arxiv.org/abs/2501.09484
作者: Zhaocheng Liu,Quan Tu,Wen Ye,Yu Xiao,Zhishou Zhang,Hengfu Cui,Yalun Zhu,Qiang Ju,Shizheng Li,Jian Xie
机构: Baichuan Inc.(百川智能); Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online medical consultation (OMC) restricts doctors to gathering patient information solely through inquiries, making the already complex sequential decision-making process of diagnosis even more challenging. Recently, the rapid advancement of large language models has demonstrated a significant potential to transform OMC. However, most studies have primarily focused on improving diagnostic accuracy under conditions of relatively sufficient information, while paying limited attention to the “inquiry” phase of the consultation process. This lack of focus has left the relationship between “inquiry” and “diagnosis” insufficiently explored. In this paper, we first extract real patient interaction strategies from authentic doctor-patient conversations and use these strategies to guide the training of a patient simulator that closely mirrors real-world behavior. By inputting medical records into our patient simulator to simulate patient responses, we conduct extensive experiments to explore the relationship between “inquiry” and “diagnosis” in the consultation process. Experimental results demonstrate that inquiry and diagnosis adhere to the Liebig’s law: poor inquiry quality limits the effectiveness of diagnosis, regardless of diagnostic capability, and vice versa. Furthermore, the experiments reveal significant differences in the inquiry performance of various models. To investigate this phenomenon, we categorize the inquiry process into four types: (1) chief complaint inquiry; (2) specification of known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering family or medical history. We analyze the distribution of inquiries across the four types for different models to explore the reasons behind their significant performance differences. We plan to open-source the weights and related code of our patient simulator at this https URL.
zh

[NLP-16] Scaling Graph-Based Dependency Parsing with Arc Vectorization and Attention-Based Refinement

【速读】：该论文旨在解决基于图的依存句法分析（dependency parsing）中存在的两个主要问题：一是传统的两阶段方法（arc scoring 和 labeling 分开处理）导致的信息瓶颈和参数共享不足，从而影响模型的可扩展性；二是现有方法在处理高阶依存关系时效率低下。论文提出的解决方案是通过一种新颖的架构，将弧评分（arc scoring）和标签评分（labeling）统一到一个单一的网络中，从而减少信息瓶颈并增强参数共享。此外，该架构利用 Transformer 层来高效模拟高阶依存关系，克服了传统方法中弧交互有限的缺陷。实验结果表明，该模型在 PTB 和 UD 数据集上的准确性和效率均优于当前最先进的句法分析器。

链接: https://arxiv.org/abs/2501.09451
作者: Nicolas Floquet,Joseph Le Roux,Nadi Tomeh,Thierry Charnois
机构: Université Sorbonne Paris Nord (索邦巴黎北大学); CNRS (法国国家科学研究中心); Laboratoire d’Informatique de Paris Nord (巴黎北计算机科学实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel architecture for graph-based dependency parsing that explicitly constructs vectors, from which both arcs and labels are scored. Our method addresses key limitations of the standard two-pipeline approach by unifying arc scoring and labeling into a single network, reducing scalability issues caused by the information bottleneck and lack of parameter sharing. Additionally, our architecture overcomes limited arc interactions with transformer layers to efficiently simulate higher-order dependencies. Experiments on PTB and UD show that our model outperforms state-of-the-art parsers in both accuracy and efficiency.
zh

[NLP-17] Solving the unsolvable: Translating case law in Hong Kong

【速读】：该论文探讨了在香港双语法律体系下翻译判例法（case law）所面临的挑战。尽管香港在1997年回归前成功将所有成文法（statutes）翻译为中文，但判例法的翻译仍存在显著困难，主要原因是判例数量庞大且持续增长。论文批评了政府和司法机构在判例法翻译方面的零散和不协调努力，并对比了此前成文法翻译的全面性。尽管政府承认法律双语化的重要性，但缺乏可持续的判例法翻译策略。司法机构认为翻译所有判决既不必要、不现实，也不具成本效益，这一立场影响了法律透明度和公众信任。论文提出的解决方案是借助机器翻译技术，通过人机交互翻译平台实现高效、高质量的判例法翻译。该平台经历了两个主要转变：首先从基于神经模型（neural model）的翻译系统过渡到使用大语言模型（large language model）以提高翻译准确性；其次从单代理系统（single-agent system）发展为多代理系统（multi-agent system），整合了翻译员（Translator）、注释员（Annotator）和校对员（Proofreader）代理。这一多代理方法通过整合先进的人工智能技术和持续反馈机制，旨在更好地满足双语法律体系的需求。

链接: https://arxiv.org/abs/2501.09444
作者: King-kui Sin,Xi Xuan,Chunyu Kit,Clara Ho-yan Chan,Honic Ho-kin Ip
机构: 1UOW College Hong Kong(香港伍伦贡学院); 2City University of Hong Kong(香港城市大学); 3The Chinese University of Hong Kong Shenzhen(香港中文大学深圳); 4The University of Hong Kong SPACE(香港大学专业进修学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This paper addresses the challenges translating case law under Hong Kong’s bilingual legal system. It highlights the initial success of translating all written statutes into Chinese before the 1997 handover, a task mandated by the Basic Law. The effort involved significant collaboration among legal, linguistic, and translation experts, resulting in a comprehensive and culturally appropriate bilingual legal system. However, translating case law remains a significant challenge due to the sheer volume and continuous growth of judicial decisions. The paper critiques the governments and judiciarys sporadic and uncoordinated efforts to translate case law, contrasting it with the thorough approach previously taken for statute translation. Although the government acknowledges the importance of legal bilingualism, it lacks a sustainable strategy for translating case law. The Judiciarys position that translating all judgments is unnecessary, unrealistic, and not cost-effectiveis analyzed and critiqued for its impact on legal transparency and public trust. A proposed solution involves leveraging machine translation technology through a human-machine interactive translation platform, which undergoes two major transitions. Initially based on a neural model, the platform transitions to using a large language model for improved translation accuracy. Furthermore, it evolves from a single-agent system to a multi-agent system, incorporating Translator, Annotator, and Proofreader agents. This multi-agent approach, supported by a grant, aims to facilitate efficient, high-quality translation of judicial judgments by integrating advanced artificial intelligence and continuous feedback mechanisms, thus better meeting the needs of a bilingual legal system.
zh

[NLP-18] A Survey on Responsible LLM s: Inherent Risk Malicious Use and Mitigation Strategy

【速读】：该论文旨在解决大语言模型（LLMs）在实际应用中面临的隐私泄露、幻觉输出、价值错位以及恶意使用等问题。这些问题包括模型在数据收集和预训练、微调和对齐、提示和推理、以及后处理和审计四个阶段中的潜在风险。论文提出了一种综合框架，通过隐私保护、幻觉减少、价值对齐、毒性消除和越狱防御等方面的最新进展，来增强LLMs的性能。解决方案的关键在于将这些不同维度的责任性措施整合到一个统一的框架中，从而为LLMs在实际应用中的安全性和可靠性提供全面的改进策略。

链接: https://arxiv.org/abs/2501.09431
作者: Huandong Wang,Wenjie Fu,Yingzhou Tang,Zhilong Chen,Yuxi Huang,Jinghua Piao,Chen Gao,Fengli Xu,Tao Jiang,Yong Li
机构: Department of Electronic Engineering, Tsinghua University (清华大学电子工程系); Huazhong University of Science and Technology (华中科技大学); BNRist, Tsinghua University (清华大学北京信息科学与技术国家研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:While large language models (LLMs) present significant potential for supporting numerous real-world applica- tions and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.
zh

[NLP-19] AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling

【速读】：该论文试图解决传统面对面心理咨询（in-person psychological counseling）的局限性，特别是针对那些因羞耻感而不愿寻求帮助的个体。论文提出了一种基于大型语言模型（LLMs）和代理技术（agent technology）的自动认知行为疗法（CBT）诊断与治疗系统。当前基于LLM的CBT系统存在固定结构代理（fixed structure agents）的问题，导致其自我优化能力受限，或产生冗余且无用的建议。为解决这一问题，论文提出了一个通用的代理框架，利用类似Quora和YiXinLi的单轮咨询模型，生成高质量的单轮心理咨询响应。通过引入动态路由（dynamic routing）和监督机制（supervisory mechanisms），构建了一个面向CBT的自主多代理框架（autonomous multi-agent framework），并验证了其广泛适用性。实验结果表明，AutoCBT能够提供更高质量的自动化心理咨询服务。

链接: https://arxiv.org/abs/2501.09426
作者: Ancheng Xu,Di Yang,Renhao Li,Jingwei Zhu,Minghuan Tan,Min Yang,Wanxin Qiu,Mingchen Ma,Haihong Wu,Bingyu Li,Feng Sha,Chengming Li,Xiping Hu,Qiang Qu,Derek F.Wong,Ruifeng Xu
机构: 1Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳高性能数据挖掘重点实验室, 中国科学院深圳先进技术研究院); 2University of Chinese Academy of Sciences (中国科学院大学); 3University of Science and Technology of China (中国科学技术大学); 4University of Macau (澳门大学); 5Shenzhen University (深圳大学); 6Shenzhen MSU-BIT University (深圳北理莫斯科大学); 7Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM-based CBT systems use agents with a fixed structure, limiting their self-optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora-like and YiXinLi single-round consultation models to build a general agent framework that generates high-quality responses for single-turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single-response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT-oriented autonomous multi-agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher-quality automated psychological counseling services.
zh

[NLP-20] Vision-Language Models Do Not Understand Negation

【速读】：该论文试图解决当前视觉-语言模型（VLMs）在理解否定（negation）方面的不足问题。尽管通过大规模训练，VLMs在视觉-语言任务中取得了显著进展，但其对否定的理解能力仍然未被充分探索。论文提出了一个新的基准测试NegBench，旨在评估VLMs在18种任务变体和79k个示例中的否定理解能力，涵盖图像、视频和医学数据集。NegBench包括两个核心任务：带有否定的检索（Retrieval with Negation）和带有否定描述的多项选择题（Multiple Choice Questions with Negated Captions）。评估结果表明，现代VLMs在处理否定时表现较差，通常接近随机水平。为解决这一问题，论文提出了一种以数据为中心的方法，通过在包含数百万条否定描述的大规模合成数据集上微调CLIP模型，显著提升了模型在否定查询中的召回率（recall）和带有否定描述的多项选择题中的准确率（accuracy）。

链接: https://arxiv.org/abs/2501.09425
作者: Kumail Alhamoud,Shaden Alshammari,Yonglong Tian,Guohao Li,Philip Torr,Yoon Kim,Marzyeh Ghassemi
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.
zh

[NLP-21] mGeNTE: A Multilingual Resource for Gender-Neutral Language and Translation

【速读】：该论文试图解决在语言技术中存在的性别偏见问题，特别是在涉及语法性别语言（grammatical gender languages）时，这些语言在指代人类时过度依赖男性形式，即使性别未指定或不相关。这种偏见在翻译过程中尤为明显，导致性别刻板印象的延续。为了解决这一问题，论文提出了多语言的mGeNTe数据集，该数据集扩展了原有的双语GeNTE数据集，涵盖了英语-意大利语/德语/西班牙语的语言对。mGeNTe的关键在于提供了目标语言中的性别化和中性句子，从而支持自动性别中性翻译（Gender-Neutral Translation, GNT）和语言建模的研究，帮助减少不必要的二元性别假设，推动更公平的多语言和跨语言技术的发展。

链接: https://arxiv.org/abs/2501.09409
作者: Beatrice Savoldi,Eleonora Cupin,Manjinder Thind,Anne Lauscher,Luisa Bentivogli
机构: Fondazione Bruno Kessler, Italy (布鲁诺·凯斯勒基金会, 意大利); DIT Forlì, University of Bologna, Italy (博洛尼亚大学DIT福尔利分校, 意大利); Data Science Group, University of Hamburg (汉堡大学数据科学组)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Gender-neutral language reflects societal and linguistic shifts towards greater inclusivity by avoiding the implication that one gender is the norm over others. This is particularly relevant for grammatical gender languages, which heavily encode the gender of terms for human referents and over-relies on masculine forms, even when gender is unspecified or irrelevant. Language technologies are known to mirror these inequalities, being affected by a male bias and perpetuating stereotypical associations when translating into languages with extensive gendered morphology. In such cases, gender-neutral language can help avoid undue binary assumptions. However, despite its importance for creating fairer multi- and cross-lingual technologies, inclusive language research remains scarce and insufficiently supported in current resources. To address this gap, we present the multilingual mGeNTe dataset. Derived from the bilingual GeNTE (Piergentili et al., 2023), mGeNTE extends the original corpus to include the English-Italian/German/Spanish language pairs. Since each language pair is English-aligned with gendered and neutral sentences in the target languages, mGeNTE enables research in both automatic Gender-Neutral Translation (GNT) and language modelling for three grammatical gender languages.
zh

[NLP-22] Evaluating LLM Abilities to Understand Tabular Electronic Health Records: A Comprehensive Study of Patient Data Extraction and Retrieval ECIR

【速读】：该论文旨在解决电子健康记录（Electronic Health Record, EHR）表格中存在的隐藏上下文依赖性问题，特别是在高维度和稀疏数据环境下，如何利用大语言模型（LLMs）进行患者数据的提取和检索。研究通过实验探索了提示结构、指令、上下文和示例对Llama2和Meditron两种骨干LLMs任务性能的影响。关键解决方案包括优化特征选择和序列化方法，这些方法相比简单方法可提升任务性能高达26.79%。此外，通过选择相关示例进行上下文学习，数据提取性能提升了5.95%。基于这些发现，研究提出了设计基于LLM的健康搜索模型的指导原则。

链接: https://arxiv.org/abs/2501.09384
作者: Jesus Lovon(IRIT-IRIS),Martin Mouysset(IRIT-IRIS),Jo Oleiwan(IRIT-IRIS),Jose G. Moreno(IRIT-IRIS),Christine Damase-Michel,Lynda Tamine(IRIT-IRIS)
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: To be published as full paper in the Proceedings of the European Conference on Information Retrieval (ECIR) 2025. Preprint

点击查看摘要

Abstract:Electronic Health Record (EHR) tables pose unique challenges among which is the presence of hidden contextual dependencies between medical features with a high level of data dimensionality and sparsity. This study presents the first investigation into the abilities of LLMs to comprehend EHRs for patient data extraction and retrieval. We conduct extensive experiments using the MIMICSQL dataset to explore the impact of the prompt structure, instruction, context, and demonstration, of two backbone LLMs, Llama2 and Meditron, based on task performance. Through quantitative and qualitative analyses, our findings show that optimal feature selection and serialization methods can enhance task performance by up to 26.79% compared to naive approaches. Similarly, in-context learning setups with relevant example selection improve data extraction performance by 5.95%. Based on our study findings, we propose guidelines that we believe would help the design of LLM-based models to support health search.
zh

[NLP-23] ChartInsighter: An Approach for Mitigating Hallucination in Time-series Chart Summary Generation with A Benchmark Dataset

【速读】：该论文旨在解决时间序列数据图表（time-series data charts）自动生成摘要时存在的准确性和语义丰富性问题，特别是减少生成过程中的幻觉（hallucination）现象。解决方案的关键在于引入了ChartInsighter系统，该系统通过分配多个代理（agents）生成初始图表摘要，并迭代协作调用外部数据分析模块来提取洞察，最终将这些洞察编译成连贯的摘要。此外，系统还采用了自一致性测试（self-consistency test）方法来验证和修正生成的摘要，从而有效减少幻觉。论文还创建了一个高质量的图表和摘要基准（benchmark），并在句子级别标注了幻觉类型，用于评估减少幻觉的效果。实验结果表明，该方法在减少幻觉和提高摘要质量方面优于现有最先进的模型。

链接: https://arxiv.org/abs/2501.09349
作者: Fen Wang,Bomiao Wang,Xueli Shu,Zhen Liu,Zekai Shao,Chao Liu,Siming Chen
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Effective chart summary can significantly reduce the time and effort decision makers spend interpreting charts, enabling precise and efficient communication of data insights. Previous studies have faced challenges in generating accurate and semantically rich summaries of time-series data charts. In this paper, we identify summary elements and common hallucination types in the generation of time-series chart summaries, which serve as our guidelines for automatic generation. We introduce ChartInsighter, which automatically generates chart summaries of time-series data, effectively reducing hallucinations in chart summary generation. Specifically, we assign multiple agents to generate the initial chart summary and collaborate iteratively, during which they invoke external data analysis modules to extract insights and compile them into a coherent summary. Additionally, we implement a self-consistency test method to validate and correct our summary. We create a high-quality benchmark of charts and summaries, with hallucination types annotated on a sentence-by-sentence basis, facilitating the evaluation of the effectiveness of reducing hallucinations. Our evaluations using our benchmark show that our method surpasses state-of-the-art models, and that our summary hallucination rate is the lowest, which effectively reduces various hallucinations and improves summary quality. The benchmark is available at this https URL.
zh

[NLP-24] Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili

【速读】：该论文试图解决低资源语言（如斯瓦希里语）在机器学习任务中因缺乏足够训练数据而难以处理的问题。这些低资源语言在人类日常交流中仍然非常重要，用户需要诸如摘要生成、消歧和问答（QA）等实际机器处理任务。论文提出的解决方案关键是通过语义网络（semantic networks）来处理这些语言，从而绕过对大量训练数据的依赖。由于斯瓦希里语等低资源语言具有主谓宾（SVO）结构，而语义网络也是由主语-谓语-宾语三元组构成，因此可以将SVO词性标签映射到语义网络三元组中。论文提出了一种算法，能够将原始自然语言文本处理并映射到语义网络中，从而结构化低资源语言的文本。该算法在斯瓦希里语问答任务中测试，达到了78.6%的精确匹配率。

链接: https://arxiv.org/abs/2501.09326
作者: Barack Wamkaya Wanjawa,Lawrence Muchemi,Evans Miriti
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures, published in Open Journal for Information Technology

点击查看摘要

Abstract:Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with upto 78.6% exact match.
zh

[NLP-25] Shape-Based Single Object Classification Using Ensemble Method Classifiers

【速读】：该论文试图解决图像分类中的语义鸿沟问题，并提出了一种层次化分类框架，以实现多类别图像分类。解决方案的关键在于采用了一种已知的预处理和后处理方法，并将其应用于图像分割、目标识别和图像分类三个问题。该方法在Amazon和Google数据集上对单目标图像进行分类，并测试了四种不同的分类器：贝叶斯网络（BayesNetwork, BN）、随机森林（Random Forest, RF）、Bagging和Vote。实验结果表明，Bagging分类器的性能最佳，其次是随机森林分类器，分类准确率在20%到99%之间（使用10折交叉验证）。

链接: https://arxiv.org/abs/2501.09311
作者: Nur Shazwani Kamarudin,Mokhairi Makhtar,Syadiah Nor Wan Shamsuddin,Syed Abdullah Fadzli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Nowadays, more and more images are available. Annotation and retrieval of the images pose classification problems, where each class is defined as the group of database images labelled with a common semantic label. Various systems have been proposed for content-based retrieval, as well as for image classification and indexing. In this paper, a hierarchical classification framework has been proposed for bridging the semantic gap effectively and achieving multi-category image classification. A well known pre-processing and post-processing method was used and applied to three problems; image segmentation, object identification and image classification. The method was applied to classify single object images from Amazon and Google datasets. The classification was tested for four different classifiers; BayesNetwork (BN), Random Forest (RF), Bagging and Vote. The estimated classification accuracies ranged from 20% to 99% (using 10-fold cross validation). The Bagging classifier presents the best performance, followed by the Random Forest classifier.
zh

[NLP-26] A Study of In-Context-Learning-Based Text-to-SQL Errors

【速读】：该论文旨在解决大语言模型（LLMs）在执行文本到SQL（text-to-SQL）任务时面临的错误问题。尽管LLMs能够利用上下文学习（in-context learning, ICL）能力将自然语言问题转换为结构化查询语言（SQL），但这种方法存在广泛的错误，并且需要高效的修复方案。论文首次对text-to-SQL错误进行了全面研究，涵盖了四种代表性的基于ICL的技术、五种基本修复方法、两个基准测试和两种LLM设置。研究发现，text-to-SQL错误普遍存在，并总结了7个类别下的29种错误类型。此外，现有的修复尝试在提高正确性方面效果有限，且伴随着高计算开销和大量误修复。基于这些发现，论文提出了MapleRepair，一种新颖的text-to-SQL错误检测和修复框架。评估结果表明，MapleRepair在修复更多查询（提升13.8%）、减少误修复和降低计算开销（减少67.4%）方面优于现有解决方案。

链接: https://arxiv.org/abs/2501.09310
作者: Jiawei Shen,Chengcheng Wan,Ruoyi Qiao,Jiazhen Zou,Hang Xu,Yuchen Shao,Yueling Zhang,Weikai Miao,Geguang Pu
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been adopted to perform text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into structured query language (SQL). However, such a technique faces correctness problems and requires efficient repairing solutions. In this paper, we conduct the first comprehensive study of text-to-SQL errors. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 29 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement at the cost of high computational overhead with many mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleRepair outperforms existing solutions by repairing 13.8% more queries with neglectable mis-repairs and 67.4% less overhead.
zh

[NLP-27] Understanding Mental Health Content on Social Media and Its Effect Towards Suicidal Ideation

【速读】：该论文旨在解决如何有效识别和支持有自杀意念（suicidal ideation）的个体，并利用机器学习（ML）和深度学习（DL）技术来推动自杀预防工作。解决方案的关键在于应用这些技术分析大量的非结构化社交媒体数据，以检测与自杀意念相关的语言模式、关键词、短语、语气和上下文线索。论文探讨了多种ML和DL模型，如支持向量机（SVMs）、卷积神经网络（CNNs）、长短期记忆网络（LSTM）和神经网络，评估它们在解释复杂数据模式和文本数据中的情感细微差别方面的有效性。此外，论文还讨论了这些技术在现实世界中的有效性、局限性和伦理问题，强调了负责任开发和使用的必要性。通过分析近期研究、方法、工具和技术，论文旨在填补关键知识空白，并指导开发可靠、伦理的早期干预系统。

链接: https://arxiv.org/abs/2501.09309
作者: Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Nur Hafieza Ismail
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This review underscores the critical need for effective strategies to identify and support individuals with suicidal ideation, exploiting technological innovations in ML and DL to further suicide prevention efforts. The study details the application of these technologies in analyzing vast amounts of unstructured social media data to detect linguistic patterns, keywords, phrases, tones, and contextual cues associated with suicidal thoughts. It explores various ML and DL models like SVMs, CNNs, LSTM, neural networks, and their effectiveness in interpreting complex data patterns and emotional nuances within text data. The review discusses the potential of these technologies to serve as a life-saving tool by identifying at-risk individuals through their digital traces. Furthermore, it evaluates the real-world effectiveness, limitations, and ethical considerations of employing these technologies for suicide prevention, stressing the importance of responsible development and usage. The study aims to fill critical knowledge gaps by analyzing recent studies, methodologies, tools, and techniques in this field. It highlights the importance of synthesizing current literature to inform practical tools and suicide prevention efforts, guiding innovation in reliable, ethical systems for early intervention. This research synthesis evaluates the intersection of technology and mental health, advocating for the ethical and responsible application of ML, DL, and NLP to offer life-saving potential worldwide while addressing challenges like generalizability, biases, privacy, and the need for further research to ensure these technologies do not exacerbate existing inequities and harms.
zh

[NLP-28] Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning

【速读】：该论文试图解决医学图像分类中的少样本学习（few-shot learning）问题，这一问题由于标注数据的稀缺性和医学图像的复杂性而具有显著挑战性。论文提出的解决方案是“自适应视觉-语言微调与层次对比对齐（Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment, HiCA）”，其关键在于利用大规模视觉-语言模型（Large Vision-Language Models, LVLMs）的能力，通过两阶段的微调策略，结合领域特定的预训练和层次对比学习，在多个层次上对齐视觉和文本表示。该方法在Chest X-ray和Breast Ultrasound两个基准数据集上进行了评估，展示了在少样本和零样本设置下的最先进性能，并表现出较强的鲁棒性、泛化能力和可解释性。

链接: https://arxiv.org/abs/2501.09294
作者: Harrison Fuller,Fernando Gabriela Garcia,Victor Flores
机构: Autonomous University of Nuevo León(新莱昂自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Few-shot learning in medical image classification presents a significant challenge due to the limited availability of annotated data and the complex nature of medical imagery. In this work, we propose Adaptive Vision-Language Fine-tuning with Hierarchical Contrastive Alignment (HiCA), a novel framework that leverages the capabilities of Large Vision-Language Models (LVLMs) for medical image analysis. HiCA introduces a two-stage fine-tuning strategy, combining domain-specific pretraining and hierarchical contrastive learning to align visual and textual representations at multiple levels. We evaluate our approach on two benchmark datasets, Chest X-ray and Breast Ultrasound, achieving state-of-the-art performance in both few-shot and zero-shot settings. Further analyses demonstrate the robustness, generalizability, and interpretability of our method, with substantial improvements in performance compared to existing baselines. Our work highlights the potential of hierarchical contrastive strategies in adapting LVLMs to the unique challenges of medical imaging tasks.
zh

[NLP-29] o Retrieve or Not to Retrieve? Uncertainty Detection for Dynamic Retrieval Augmented Generation

【速读】：该论文试图解决在大语言模型（LLM）中如何有效利用外部知识检索（Retrieval-Augmented Generation）来减少模型生成内容中的幻觉（hallucinations）问题，特别是在长文本问答任务中的应用。传统方法通常采用确定性检索（deterministic retrieval），即在每次生成时都调用检索机制，这可能导致效率低下。论文提出了一种动态检索（dynamic retrieval）策略，即仅在模型缺乏所需知识时调用检索机制，从而提高效率。为此，论文深入探讨了“何时检索”的问题，并通过多种不确定性检测方法（uncertainty detection methods）来评估动态检索的效果。研究结果表明，使用如Degree Matrix Jaccard和Eccentricity等不确定性检测指标，可以在几乎不影响问答准确性的情况下，将检索调用次数减少近一半。

链接: https://arxiv.org/abs/2501.09292
作者: Kaustubh D. Dhole
机构: Department of Computer Science, Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation equips large language models with the capability to retrieve external knowledge, thereby mitigating hallucinations by incorporating information beyond the model’s intrinsic abilities. However, most prior works have focused on invoking retrieval deterministically, which makes it unsuitable for tasks such as long-form question answering. Instead, dynamically performing retrieval by invoking it only when the underlying LLM lacks the required knowledge can be more efficient. In this context, we delve deeper into the question, “To Retrieve or Not to Retrieve?” by exploring multiple uncertainty detection methods. We evaluate these methods for the task of long-form question answering, employing dynamic retrieval, and present our comparisons. Our findings suggest that uncertainty detection metrics, such as Degree Matrix Jaccard and Eccentricity, can reduce the number of retrieval calls by almost half, with only a slight reduction in question-answering accuracy.
zh

[NLP-30] Perspective Transition of Large Language Models for Solving Subjective Tasks

【速读】：该论文试图解决大型语言模型（LLMs）在处理主观任务时表现受限的问题。主观任务的性能往往依赖于对特定问题的视角选择，而现有的方法通常采用单一的固定视角（如直接视角、专家视角或第三人称视角），这限制了模型在不同情境下的适应性和准确性。论文提出的解决方案是通过“视角转换推理”（Reasoning through Perspective Transition, RPT）方法，使LLMs能够动态选择最适合的视角（直接视角、角色视角或第三人称视角）来解决相应的主观问题。RPT基于上下文学习，通过实验验证，该方法在使用闭源和开源LLMs（如GPT-4、GPT-3.5、Llama-3和Qwen-2）的12个主观任务中，显著优于传统的单一固定视角方法（如链式思维提示和专家提示），展示了LLMs通过视角转换提供更细致和上下文适当响应的能力。

链接: https://arxiv.org/abs/2501.09265
作者: Xiaolong Wang,Yuanchi Zhang,Ziyue Wang,Yuzhuang Xu,Fuwen Luo,Yile Wang,Peng Li,Yang Liu
机构: Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University (清华大学计算机科学与技术系, 人工智能研究所); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院); Jiuquan Satellite Launch Center (JSLC) (酒泉卫星发射中心); Harbin Institute of Technology (哈尔滨工业大学); College of Computer Science and Software Engineering, Shenzhen University (深圳大学计算机与软件工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized the field of natural language processing, enabling remarkable progress in various tasks. Different from objective tasks such as commonsense reasoning and arithmetic question-answering, the performance of LLMs on subjective tasks is still limited, where the perspective on the specific problem plays crucial roles for better interpreting the context and giving proper response. For example, in certain scenarios, LLMs may perform better when answering from an expert role perspective, potentially eliciting their relevant domain knowledge. In contrast, in some scenarios, LLMs may provide more accurate responses when answering from a third-person standpoint, enabling a more comprehensive understanding of the problem and potentially mitigating inherent biases. In this paper, we propose Reasoning through Perspective Transition (RPT), a method based on in-context learning that enables LLMs to dynamically select among direct, role, and third-person perspectives for the best way to solve corresponding subjective problem. Through extensive experiments on totally 12 subjective tasks by using both closed-source and open-source LLMs including GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single fixed perspective based methods such as chain-of-thought prompting and expert prompting, highlights the intricate ways that LLMs can adapt their perspectives to provide nuanced and contextually appropriate responses for different problems.
zh

[NLP-31] Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition ICASSP2025

【速读】：该论文试图解决在端到端自动语音识别（E2E-ASR）中使用大语言模型（LLMs）时面临的两个主要问题：(1) LLM推理计算成本高；(2) ASR模型与LLM之间可能存在词汇不匹配（vocabulary mismatch）的问题。为解决这些问题，论文提出了一种名为“延迟融合”（delayed fusion）的解码方法。该方法的关键在于在解码过程中延迟应用LLM的评分，从而减少LLM推理的调用次数，并允许在ASR和LLM使用不同分词方式时进行重新分词（re-tokenization）。通过这种方法，论文展示了延迟融合在解码速度和准确性上优于浅层融合（shallow fusion）和N-best重评分（N-best rescoring），并在LibriHeavy ASR语料库和三个公开的LLM（OpenLLaMA 3B、7B和Mistral 7B）上进行了验证。

链接: https://arxiv.org/abs/2501.09258
作者: Takaaki Hori,Martin Kocour,Adnan Haider,Erik McDermott,Xiaodan Zhuang
机构: Apple(苹果); Brno University of Technology(布尔诺理工大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ICASSP2025

点击查看摘要

Abstract:This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). Although shallow fusion is the most common approach to incorporate language models into E2E-ASR decoding, we face two practical problems with LLMs. (1) LLM inference is computationally costly. (2) There may be a vocabulary mismatch between the ASR model and the LLM. To resolve this mismatch, we need to retrain the ASR model and/or the LLM, which is at best time-consuming and in many cases not feasible. We propose “delayed fusion,” which applies LLM scores to ASR hypotheses with a delay during decoding and enables easier use of pre-trained LLMs in ASR tasks. This method can reduce not only the number of hypotheses scored by the LLM but also the number of LLM inference calls. It also allows re-tokenizion of ASR hypotheses during decoding if ASR and LLM employ different tokenizations. We demonstrate that delayed fusion provides improved decoding speed and accuracy compared to shallow fusion and N-best rescoring using the LibriHeavy ASR corpus and three public LLMs, OpenLLaMA 3B 7B and Mistral 7B.
zh

[NLP-32] Foundations of Large Language Models

【速读】：该论文旨在为对大型语言模型（Large Language Models, LLMs）感兴趣的学生、专业人士和从业者提供基础概念的介绍，而非全面覆盖所有前沿技术。其核心解决方案是通过四个主要章节来探讨关键领域：预训练（pre-training）、生成式模型（generative models）、提示技术（prompting techniques）和对齐方法（alignment methods）。这些章节系统地介绍了大型语言模型的基础理论和实践方法，帮助读者理解其工作原理和应用场景。

链接: https://arxiv.org/abs/2501.09223
作者: Tong Xiao,Jingbo Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into four main chapters, each exploring a key area: pre-training, generative models, prompting techniques, and alignment methods. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.
zh

[NLP-33] A Simple Graph Contrastive Learning Framework for Short Text Classification AAAI2025

【速读】：该论文旨在解决短文本分类（short text classification）中的两个主要问题：语义稀疏性（semantic sparsity）和标注数据有限（limited labeled data）。现有模型通常依赖显式数据增强（explicit data augmentation）技术生成对比视图（contrastive views），这可能导致语义损坏和噪声引入。此外，这些模型仅关注生成视图之间的内在一致性，而忽略了其他潜在视图中的有价值判别信息。为解决这些问题，论文提出了一种简单的图对比学习框架（Simple graph contrastive learning framework for Short Text Classification, SimSTC）。该框架的关键在于通过多文本相关组件图（multiple text-related component graphs）进行图学习，获得多视图文本嵌入（multi-view text embeddings），并直接在这些嵌入上应用对比学习。该方法无需数据增强操作即可生成对比视图，同时充分利用多视图对比学习的优势，最终在多个数据集上超越了大型语言模型的性能。

链接: https://arxiv.org/abs/2501.09219
作者: Yonghao Liu,Fausto Giunchiglia,Lan Huang,Ximing Li,Xiaoyue Feng,Renchu Guan
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI2025

点击查看摘要

Abstract:Short text classification has gained significant attention in the information age due to its prevalence and real-world applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
zh

[NLP-34] Boosting Short Text Classification with Multi-Source Information Exploration and Dual-Level Contrastive Learning AAAI2025

【速读】：该论文试图解决短文本分类（short text classification）中的语义稀疏性和标注样本不足的问题。为了解决这些问题，作者提出了一种名为MI-DELIGHT的新模型。该模型的关键解决方案包括：1）通过多源信息（multi-source information）探索（包括统计信息、语言信息和事实信息）来缓解语义稀疏性问题；2）采用图学习方法（graph learning approach）来学习以图形式表示的短文本表示；3）引入双层次（dual-level）对比学习辅助任务（包括实例级和集群级），以有效捕捉大规模未标注数据中的不同粒度对比信息；4）通过层次化架构（hierarchical architecture）显式建模任务之间的相关性，而以往模型仅并行执行主任务和辅助任务，未考虑任务间的关系。实验结果表明，MI-DELIGHT在多个基准数据集上显著优于现有竞争模型，甚至在某些数据集上超越了流行的大语言模型。

链接: https://arxiv.org/abs/2501.09214
作者: Yonghao Liu,Mengyu Li,Wei Pang,Fausto Giunchiglia,Lan Huang,Xiaoyue Feng,Renchu Guan
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI2025

点击查看摘要

Abstract:Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.
zh

[NLP-35] FineMedLM-o1 : Enhancing the Medical Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

【速读】：该论文旨在解决现有大型语言模型（LLMs）在复杂临床场景中推理能力不足的问题，特别是在鉴别诊断（differential diagnosis）和个性化治疗建议（personalized treatment suggestions）方面。解决方案的关键在于提出了FineMedLM-o1模型，该模型通过高质量合成医疗数据和长形式推理数据进行监督微调（Supervised Fine-Tuning, SFT）和直接偏好优化（Direct Preference Optimization, DPO），从而增强了模型的对话和深度推理能力。此外，论文首次在医疗领域引入了测试时训练（Test-Time Training, TTT），以促进领域适应并确保推理的可靠性和准确性。实验结果表明，FineMedLM-o1在关键医疗基准测试中平均性能提升了23%，而TTT的引入进一步带来了14%的性能提升。为了支持这一过程，论文还提出了一种新的医疗对话合成方法，其数据集在质量和复杂性上均优于其他开源数据集。

链接: https://arxiv.org/abs/2501.09213
作者: Hongzhou Yu,Tianhao Cheng,Ying Cheng,Rui Feng
机构: School of Computer Science, Fudan University (复旦大学计算机学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the advanced reasoning required for complex clinical scenarios, such as differential diagnosis or personalized treatment suggestions. We proposed FineMedLM-o1, which leverages high-quality synthetic medical data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduced Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also proposed a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.
zh

[NLP-36] he Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching

【速读】：该论文旨在解决如何评估大语言模型（LLMs）在拉脱维亚语和立陶宛语的短答案匹配任务中的表现问题。为了解决这一问题，作者引入了包含502个拉脱维亚语和690个立陶宛语问答对的新数据集。关键解决方案是通过设计一组特定的文本修改规则，生成匹配和不匹配的答案，以测试LLMs在检测文本细微差异方面的能力。这些生成的答案作为测试用例，用于评估LLMs在区分原始答案与修改后答案时的表现。研究结果表明，较大的LLMs（如QWEN2.5 72b和LLaMa3.1 70b）在区分匹配和不匹配答案时表现出近乎完美的性能，而较小的模型则表现出更大的性能差异。

链接: https://arxiv.org/abs/2501.09164
作者: Yevhen Kostiuk,Oxana Vitman,Łukasz Gagała,Artur Kiulian
机构: ARG-tech, University of Dundee (邓迪大学); OpenBabylon; University of Bremen (不来梅大学); Georg-August Universität Göttingen (哥廷根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we address the challenge of evaluating large language models (LLMs) on the short answer matching task for Latvian and Lithuanian languages. We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian question-answer pairs. For each question-answer pair, we generated matched and non-matched answers using a set of alteration rules specifically designed to introduce small but meaningful changes in the text. These generated answers serve as test cases to assess the ability of LLMs to detect subtle differences in matching of the original answers. A subset of the datasets was manually verified for quality and accuracy. Our results show that while larger LLMs, such as QWEN2.5 72b and LLaMa3.1 70b, demonstrate near-perfect performance in distinguishing matched and non-matched answers, smaller models show more variance. For instance, LLaMa3.1 8b and EuroLLM 9b benefited from few-shot examples, while Mistral Nemo 12b underperformed on detection of subtle text alteration, particularly in Lithuanian, even with additional examples. QWEN2.5 7b and Mistral 7b were able to obtain a strong and comparable performance to the larger 70b models in zero and few shot experiments. Moreover, the performance of Mistral 7b was weaker in few shot experiments.
zh

[NLP-37] Evaluating GenAI for Simplifying Texts for Education: Improving Accuracy and Consistency for Enhanced Readability

【速读】：该论文旨在解决如何利用生成式人工智能（Generative AI）支持个性化学习，特别是通过简化教育文本的阅读难度，使其适应不同学生的阅读水平，同时保留关键细节。论文提出了一种系统评估方法，用于评估大型语言模型（LLMs）、提示技术（prompting techniques）以及一种新颖的多代理架构（multi-agent architecture）在将信息性阅读材料从十二年级水平简化至八年级、六年级和四年级水平时的准确性和一致性。解决方案的关键在于引入了一套通用的评估指标，包括目标年级水平的准确性、词汇量的变化百分比、以及关键词和关键短语的一致性（语义相似性）。通过单样本t检验和多元回归模型，研究发现不同LLMs和提示技术在简化文本时的表现存在显著差异，尤其是在将内容简化至四年级水平时。这些结果表明，尽管LLMs在自动化文本简化方面具有潜力，但现有模型和提示方法在平衡各项评估标准方面仍存在不足，论文提出的评估方法为未来系统的改进提供了可推广的框架。

链接: https://arxiv.org/abs/2501.09158
作者: Stephanie L. Day,Jacapo Cirica,Steven R. Clapp,Veronika Penkova,Amy E. Giroux,Abbey Banta,Catherine Bordeau,Poojitha Mutteneni,Ben D. Sawyer
机构: 未知
类目: Computation and Language (cs.CL)
备注: 64 pages, 9 tables, 6 figures, and supplemental materials

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) holds great promise as a tool to support personalized learning. Teachers need tools to efficiently and effectively enhance content readability of educational texts so that they are matched to individual students reading levels, while retaining key details. Large Language Models (LLMs) show potential to fill this need, but previous research notes multiple shortcomings in current approaches. In this study, we introduced a generalized approach and metrics for the systematic evaluation of the accuracy and consistency in which LLMs, prompting techniques, and a novel multi-agent architecture to simplify sixty informational reading passages, reducing each from the twelfth grade level down to the eighth, sixth, and fourth grade levels. We calculated the degree to which each LLM and prompting technique accurately achieved the targeted grade level for each passage, percentage change in word count, and consistency in maintaining keywords and key phrases (semantic similarity). One-sample t-tests and multiple regression models revealed significant differences in the best performing LLM and prompt technique for each of the four metrics. Both LLMs and prompting techniques demonstrated variable utility in grade level accuracy and consistency of keywords and key phrases when attempting to level content down to the fourth grade reading level. These results demonstrate the promise of the application of LLMs for efficient and precise automated text simplification, the shortcomings of current models and prompting methods in attaining an ideal balance across various evaluation criteria, and a generalizable method to evaluate future systems.
zh

[NLP-38] VCRScore: Image captioning metric based on VL Transformers CLIP and precision-recall

【速读】：该论文试图解决图像描述生成（Image Captioning）领域中评估指标（evaluation metrics）的局限性问题。尽管近年来图像描述生成模型取得了显著进展，但用于衡量这些模型性能的评估指标（如BLEU、METEOR、CIDEr、ROUGE等）多年来几乎没有变化，且这些传统指标可能无法充分反映模型生成描述的质量。为此，论文提出了一种新的评估指标，旨在更准确地衡量生成描述与图像内容的相关性。解决方案的关键在于首先构建了一个人工标注的数据集，用于评估生成描述与图像内容的匹配程度，并以这些人工评分作为基准，提出了一种新的评估指标。通过与现有经典和新兴指标的比较，新指标在性能上表现出色，并提供了有价值的见解。

链接: https://arxiv.org/abs/2501.09155
作者: Guillermo Ruiz,Tania Ramírez,Daniela Moctezuma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:Image captioning has become an essential Vision Language research task. It is about predicting the most accurate caption given a specific image or video. The research community has achieved impressive results by continuously proposing new models and approaches to improve the overall model’s performance. Nevertheless, despite increasing proposals, the performance metrics used to measure their advances have remained practically untouched through the years. A probe of that, nowadays metrics like BLEU, METEOR, CIDEr, and ROUGE are still very used, aside from more sophisticated metrics such as BertScore and ClipScore. Hence, it is essential to adjust how are measure the advances, limitations, and scopes of the new image captioning proposals, as well as to adapt new metrics to these new advanced image captioning approaches. This work proposes a new evaluation metric for the image captioning problem. To do that, first, it was generated a human-labeled dataset to assess to which degree the captions correlate with the image’s content. Taking these human scores as ground truth, we propose a new metric, and compare it with several well-known metrics, from classical to newer ones. Outperformed results were also found, and interesting insights were presented and discussed. Comments: 28 pages Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) MSC classes: 68Txx ACMclasses: I.5; I.4 Cite as: arXiv:2501.09155 [cs.CV] (or arXiv:2501.09155v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.09155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] owards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History

【速读】：该论文旨在评估多语言大语言模型（LLMs）在立陶宛历史和一般历史知识上的表现，特别是在多选问答任务中的表现。研究通过将立陶宛国家和一般历史问题翻译成波罗的海、北欧及其他语言（如英语、乌克兰语、阿拉伯语），来评估这些模型在文化和历史相关群体之间的知识共享能力。研究测试了多个模型，包括GPT-4o、LLaMa3.1 8b和70b、QWEN2.5 7b和72b、Mistral Nemo 12b、LLaMa3 8b、Mistral 7b、LLaMa3.2 3b，以及针对北欧语言微调的模型（GPT-SW3和LLaMa3 8b）。研究结果表明，GPT-4o在所有语言组中表现最佳，尤其是在波罗的海和北欧语言上表现略优。较大的开源模型如QWEN2.5 72b和LLaMa3.1 70b表现良好，但在波罗的海语言上的对齐较弱。较小的模型在波罗的海语言上的对齐存在明显差距，但在北欧和其他语言上表现较好。北欧微调模型并未超越多语言模型，表明仅靠共享的文化或历史背景并不能保证更好的表现。

链接: https://arxiv.org/abs/2501.09154
作者: Yevhen Kostiuk,Oxana Vitman,Łukasz Gagała,Artur Kiulian
机构: ARG-tech, University of Dundee (邓迪大学); OpenBabylon; University of Bremen (不来梅大学); Georg-August Universität Göttingen (哥廷根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we evaluated Lithuanian and general history knowledge of multilingual Large Language Models (LLMs) on a multiple-choice question-answering task. The models were tested on a dataset of Lithuanian national and general history questions translated into Baltic, Nordic, and other languages (English, Ukrainian, Arabic) to assess the knowledge sharing from culturally and historically connected groups. We evaluated GPT-4o, LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral 7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b). Our results show that GPT-4o consistently outperformed all other models across language groups, with slightly better results for Baltic and Nordic languages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70b performed well but showed weaker alignment with Baltic languages. Smaller models (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b) demonstrated gaps with LT-related alignment with Baltic languages while performing better on Nordic and other languages. The Nordic fine-tuned models did not surpass multilingual models, indicating that shared cultural or historical context alone does not guarantee better performance. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.09154 [cs.CL] (or arXiv:2501.09154v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.09154 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-40] Agent ic Retrieval-Augmented Generation: A Survey on Agent ic RAG

【速读】：该论文试图解决大型语言模型（LLMs）在处理动态、实时查询时因依赖静态训练数据而导致输出过时或不准确的问题。传统的检索增强生成（RAG）系统虽然通过整合实时数据检索来增强LLMs的上下文相关性和时效性，但其静态工作流程和多步推理及复杂任务管理方面的适应性不足。论文提出的解决方案是自主检索增强生成（Agentic RAG），通过在RAG管道中嵌入自主AI代理（autonomous AI agents），利用代理设计模式（如反思、规划、工具使用和多代理协作）来动态管理检索策略、迭代优化上下文理解，并调整工作流程以满足复杂任务需求。这一集成使Agentic RAG系统在灵活性、可扩展性和上下文感知方面具有显著优势，能够应对多样化的应用场景。

链接: https://arxiv.org/abs/2501.09136
作者: Aditi Singh,Abul Ehtesham,Saket Kumar,Tala Talaei Khoei
机构: Cleveland State University(克利夫兰州立大学); The Davey Tree Expert Company(戴维树专家公司); The MathWorks Inc(MathWorks公司); Khoury College of Computer Science(计算机科学学院), Roux Institute at Northeastern University(东北大学鲁克斯研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized artificial intelligence (AI) by enabling human like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real time queries, resulting in outdated or inaccurate outputs. Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multistep reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multiagent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows to meet complex task requirements. This integration enables Agentic RAG systems to deliver unparalleled flexibility, scalability, and context awareness across diverse applications. This survey provides a comprehensive exploration of Agentic RAG, beginning with its foundational principles and the evolution of RAG paradigms. It presents a detailed taxonomy of Agentic RAG architectures, highlights key applications in industries such as healthcare, finance, and education, and examines practical implementation strategies. Additionally, it addresses challenges in scaling these systems, ensuring ethical decision making, and optimizing performance for real-world applications, while providing detailed insights into frameworks and tools for implementing Agentic RAG Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2501.09136 [cs.AI] (or arXiv:2501.09136v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.09136 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-41] Benchmarking Robustness of Contrastive Learning Models for Medical Image-Report Retrieval ALT AAAI2025

【速读】：该论文旨在解决医学图像与临床报告之间的跨域检索问题，特别是在数据异质性和复杂性较高的情况下，如何有效关联医学图像与其对应的临床报告。解决方案的关键在于利用对比学习模型（contrastive learning models）来实现跨域检索，并通过引入遮挡检索任务（occlusion retrieval task）来评估模型在不同程度图像损坏下的性能。研究对比了四种先进的对比学习模型（CLIP、CXR-RePaiR、MedCLIP 和 CXR-CLIP），发现这些模型在面对分布外数据（out-of-distribution data）时表现敏感，性能随遮挡程度的增加而下降。尽管 MedCLIP 表现出稍强的鲁棒性，但其整体性能仍显著落后于 CXR-CLIP 和 CXR-RePaiR。研究结果表明，领域特定的训练数据对提升模型性能至关重要，且需要进一步改进模型的鲁棒性，以开发更可靠的医学跨域检索模型。

链接: https://arxiv.org/abs/2501.09134
作者: Demetrio Deanda,Yuktha Priya Masupalli,Jeong Yang,Young Lee,Zechun Cao,Gongbo Liang
机构: Texas A&M University-San Antonio (德州农工大学圣安东尼奥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: This work is accepted to AAAI 2025 Workshop – the 9th International Workshop on Health Intelligence

点击查看摘要

Abstract:Medical images and reports offer invaluable insights into patient health. The heterogeneity and complexity of these data hinder effective analysis. To bridge this gap, we investigate contrastive learning models for cross-domain retrieval, which associates medical images with their corresponding clinical reports. This study benchmarks the robustness of four state-of-the-art contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP. We introduce an occlusion retrieval task to evaluate model performance under varying levels of image corruption. Our findings reveal that all evaluated models are highly sensitive to out-of-distribution data, as evidenced by the proportional decrease in performance with increasing occlusion levels. While MedCLIP exhibits slightly more robustness, its overall performance remains significantly behind CXR-CLIP and CXR-RePaiR. CLIP, trained on a general-purpose dataset, struggles with medical image-report retrieval, highlighting the importance of domain-specific training data. The evaluation of this work suggests that more effort needs to be spent on improving the robustness of these models. By addressing these limitations, we can develop more reliable cross-domain retrieval models for medical applications.
zh

[NLP-42] Multilingual LLM s Struggle to Link Orthography and Semantics in Bilingual Word Processing

【速读】：该论文旨在探讨多语言大语言模型（LLMs）在处理双语词汇时的表现，特别是针对同源词（cognates）、非同源词（non-cognates）以及跨语言同形异义词（interlingual homographs）的语义理解和歧义消解能力。研究的关键在于评估这些模型在孤立词汇和句子语境下对这些词汇类型的处理能力。研究发现，尽管某些LLMs在孤立情况下能够较好地识别同源词和非同源词，但在处理跨语言同形异义词时表现较差，往往依赖拼写相似性而非语义理解。此外，模型在语义检索任务中的表现与语义理解能力无显著相关性，且在处理不一致句子中的跨语言同形异义词时，模型表现出不同的策略，缺乏统一的跨语言歧义处理机制。

链接: https://arxiv.org/abs/2501.09127
作者: Eshaan Tanwar,Gayatri Oke,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi, Department of Electrical Engineering (印度理工学院德里分校, 电气工程系)
类目: Computation and Language (cs.CL)
备注: Code available at: this https URL

点击查看摘要

Abstract:Bilingual lexical processing is shaped by the complex interplay of phonological, orthographic, and semantic features of two languages within an integrated mental lexicon. In humans, this is evident in the ease with which cognate words - words similar in both orthographic form and meaning (e.g., blind, meaning “sightless” in both English and German) - are processed, compared to the challenges posed by interlingual homographs, which share orthographic form but differ in meaning (e.g., gift, meaning “present” in English but “poison” in German). We investigate how multilingual Large Language Models (LLMs) handle such phenomena, focusing on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs. Specifically, we evaluate their ability to disambiguate meanings and make semantic judgments, both when these word types are presented in isolation or within sentence contexts. Our findings reveal that while certain LLMs demonstrate strong performance in recognizing cognates and non-cognates in isolation, they exhibit significant difficulty in disambiguating interlingual homographs, often performing below random baselines. This suggests LLMs tend to rely heavily on orthographic similarities rather than semantic understanding when interpreting interlingual homographs. Further, we find LLMs exhibit difficulty in retrieving word meanings, with performance in isolative disambiguation tasks having no correlation with semantic understanding. Finally, we study how the LLM processes interlingual homographs in incongruent sentences. We find models to opt for different strategies in understanding English and non-English homographs, highlighting a lack of a unified approach to handling cross-lingual ambiguities.
zh

[NLP-43] Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment

【速读】：该论文试图解决在文本分类任务中，如何有效结合人类标注数据和生成式大语言模型（LLMs）生成的合成数据，以提高分类器的性能并降低大规模标注成本的问题。解决方案的关键在于提出了一种混合方法，通过将人类标注数据与LLM生成的合成数据结合，微调传统的机器学习分类器，并将其蒸馏到一个较小的BERT模型中。该方法通过系统性地调整LLM生成样本的规模、多样性和一致性，评估了不同温度设置对模型性能的影响。研究发现，合成数据与人类标注数据的比例为80%:20%时，分类器性能达到最优。此外，较低的温度设置（如0.3）能够产生更稳定的性能提升，但可能限制模型从合成样本中的学习能力；而较高的温度设置（如0.7及以上）则可能导致性能波动甚至下降。最终，论文得出结论，结合人类标注数据和LLM生成的合成数据，能够在评估中提升文本分类模型的性能，同时兼顾人类标注的准确性和LLM输出的多样性。

链接: https://arxiv.org/abs/2501.09126
作者: Conrad Borchers,Danielle R. Thomas,Jionghao Lin,Ralph Abboud,Kenneth R. Koedinger
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Manuscript accepted to the Second Workshop on Generative AI for Learning Analytics (GenAI-LA) at LAK25

点击查看摘要

Abstract:Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples’ size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.
zh

[NLP-44] SteLLA: A Structured Grading System Using LLM s with RAG

【速读】：该论文试图解决如何使大型语言模型（LLMs）在特定任务中（如自动短答案评分，ASAG）成为可靠工具的问题。解决方案的关键在于提出了SteLLA系统，该系统结合了检索增强生成（RAG）方法，通过从教师提供的参考答案和评分标准中提取结构化信息，增强LLMs在ASAG任务中的能力。具体来说，SteLLA系统利用LLM对学生的答案进行结构化和基于问答的评估，以提供分析性评分和反馈。实验结果表明，该系统能够与人工评分者达成显著一致，并为问题中考察的所有知识点提供详细的评分和反馈。此外，通过对GPT4生成的反馈进行定性和错误分析，发现GPT4在捕捉事实方面表现良好，但在评分任务中可能倾向于从给定文本中推断过多的隐含意义，这为LLMs在ASAG系统中的使用提供了见解。

链接: https://arxiv.org/abs/2501.09092
作者: Hefei Qiu,Brian White,Ashley Ding,Reinaldo Costa,Ali Hachem,Wei Ding,Ping Chen
机构: Fitchburg State University(菲奇堡州立大学); University of Massachusetts Boston(马萨诸塞大学波士顿分校); Chantilly High School(尚蒂利高中)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong general capabilities in many applications. However, how to make them reliable tools for some specific tasks such as automated short answer grading (ASAG) remains a challenge. We present SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval Augmented Generation (RAG) approach is used to empower LLMs specifically on the ASAG task by extracting structured information from the highly relevant and reliable external knowledge based on the instructor-provided reference answer and rubric, b) an LLM performs a structured and question-answering-based evaluation of student answers to provide analytical grades and feedback. A real-world dataset that contains students’ answers in an exam was collected from a college-level Biology course. Experiments show that our proposed system can achieve substantial agreement with the human grader while providing break-down grades and feedback on all the knowledge points examined in the problem. A qualitative and error analysis of the feedback generated by GPT4 shows that GPT4 is good at capturing facts while may be prone to inferring too much implication from the given text in the grading task which provides insights into the usage of LLMs in the ASAG system.
zh

[NLP-45] Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language Models through Simulation and Task Decomposition COLING2025

【速读】：该论文试图解决大型语言模型（LLMs）在复杂心理理论（Theory of Mind, ToM）任务中表现不佳的问题。心理理论是指理解和反思他人心理状态的能力，尽管现有的封闭源代码LLMs在某些ToM任务上接近人类表现，但在涉及更结构化推理的复杂任务中仍表现较差。论文提出了一种基于“假装游戏”（pretend-play）或“模拟理论”（Simulation Theory）的推理算法“Decompose-ToM”，通过递归模拟用户视角并将ToM任务分解为更简单的功能模块（如主体识别、问题重构、世界模型更新和知识可用性），从而提升模型在复杂ToM任务中的表现。该解决方案的关键在于通过分解和模拟的方法，显著提高了模型在高级ToM任务和对话场景中的表现，且无需额外的模型训练或复杂的提示调优。

链接: https://arxiv.org/abs/2501.09056
作者: Sneheel Sarangi,Maha Elgarf,Hanan Salam
机构: NYU Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:Theory of Mind (ToM) is the ability to understand and reflect on the mental states of others. Although this capability is crucial for human interaction, testing on Large Language Models (LLMs) reveals that they possess only a rudimentary understanding of it. Although the most capable closed-source LLMs have come close to human performance on some ToM tasks, they still perform poorly on complex variations of the task that involve more structured reasoning. In this work, we utilize the concept of “pretend-play”, or Simulation Theory'' from cognitive psychology to propose Decompose-ToM’': an LLM-based inference algorithm that improves model performance on complex ToM tasks. We recursively simulate user perspectives and decompose the ToM task into a simpler set of functions: subject identification, question-reframing, world model updation, and knowledge availability. We test the algorithm on higher-order ToM tasks and a task testing for ToM capabilities in a conversational setting, demonstrating that our approach shows significant improvement across models compared to baseline methods while requiring minimal prompt tuning across tasks and no additional model training.
zh

[NLP-46] Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing

【速读】：该论文旨在解决现有视觉常识推理（Visual Commonsense Reasoning）方法在理解复杂视觉场景时过度依赖训练记忆中的知识，而未能有效利用场景中真实物体关系信息的问题。为此，作者提出了一种名为G2的新型场景图增强视觉常识推理生成方法。该方法的关键在于首先利用图像块（image patches）和大语言模型（LLMs）构建一个无位置约束的场景图（location-free scene graph），然后基于场景图的信息进行答案生成和解释。此外，作者还提出了自动场景图过滤和选择策略，以在训练过程中吸收有价值的场景图信息。通过大量实验，该方法在场景图构建、视觉常识问答和解释任务中均表现出显著的有效性。

链接: https://arxiv.org/abs/2501.09041
作者: Fan Yuan,Xiaoyuan Fang,Rong Quan,Jing Li,Wei Bi,Xiaogang Xu,Piji Li
机构: College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, 211106, China; College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, the Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education, Nanjing, 211106, China; Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China; The Chinese University of Hong Kong, Hong Kong, China
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Visual Commonsense Reasoning, which is regarded as one challenging task to pursue advanced visual scene comprehension, has been used to diagnose the reasoning ability of AI systems. However, reliable reasoning requires a good grasp of the scene’s details. Existing work fails to effectively exploit the real-world object relationship information present within the scene, and instead overly relies on knowledge from training memory. Based on these observations, we propose a novel scene-graph-enhanced visual commonsense reasoning generation method named \textit\textbfG2, which first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph’s information. We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training. Extensive experiments are conducted on the tasks and datasets of scene graph constructing and visual commonsense answering and explaining, respectively. Experimental results and ablation analysis demonstrate the effectiveness of our proposed framework.
zh

[NLP-47] Characterization of Political Polarized Users Attacked by Language Toxicity on Twitter

【速读】：该论文试图解决社交媒体上语言毒性（language toxicity）的动态传播问题，特别是探讨在美国总统选举等政治场景中，左翼（Left）、右翼（Right）和中间派（Center）用户在社交媒体上受到语言毒性攻击的差异。研究的关键在于通过分析超过5亿条Twitter推文，首次探索了不同政治倾向用户之间的语言毒性流动情况。研究发现，左翼用户比右翼和中间派用户更容易受到语言毒性攻击，这为理解社交媒体上政治分歧和回声室（echo chambers）的形成提供了新的视角。

链接: https://arxiv.org/abs/2407.12471
作者: Wentao Xu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the dynamics of language toxicity on social media is important for us to investigate the propagation of misinformation and the development of echo chambers for political scenarios such as U.S. presidential elections. Recent research has used large-scale data to investigate the dynamics across social media platforms. However, research on the toxicity dynamics is not enough. This study aims to provide a first exploration of the potential language toxicity flow among Left, Right and Center users. Specifically, we aim to examine whether Left users were easier to be attacked by language toxicity. In this study, more than 500M Twitter posts were examined. It was discovered that Left users received much more toxic replies than Right and Center users.
zh

计算机视觉

[CV-0] Distilling Multi-modal Large Language Models for Autonomous Driving

【速读】：该论文旨在解决自动驾驶系统中在关键“长尾”（long-tail）场景下的安全运动规划问题。现有的端到端自动驾驶系统通常利用大语言模型（LLMs）作为规划器，以提高对罕见事件的泛化能力，但这种方法在测试时引入了较高的计算成本。为解决这一问题，论文提出了DiMA系统，该系统通过将多模态大语言模型的信息蒸馏到基于视觉的端到端规划器中，从而在保持高效性的同时利用LLM的世界知识。DiMA通过一组专门设计的代理任务，在联合训练策略下，使场景编码器生成语义基础且与最终规划目标对齐的结构化表示。关键之处在于，DiMA在推理阶段无需依赖LLM，从而在不牺牲效率的情况下实现鲁棒的规划。实验结果表明，DiMA显著降低了基于视觉规划器的轨迹误差和碰撞率，并在长尾场景中表现出色，达到了nuScenes规划基准的最先进性能。

链接: https://arxiv.org/abs/2501.09757
作者: Deepti Hegde,Rajeev Yasarla,Hong Cai,Shizhong Han,Apratim Bhattacharyya,Shweta Mahajan,Litian Liu,Risheek Garrepalli,Vishal M. Patel,Fatih Porikli
机构: Johns Hopkins University(约翰霍普金斯大学); Qualcomm AI Research(高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving demands safe motion planning, especially in critical “long-tail” scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
zh

[CV-1] SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces

【速读】：该论文试图解决肖像重光照（portrait relighting）问题，即根据环境光照条件的变化对图像中的像素进行重新渲染。解决方案的关键在于提出了一种名为SynthLight的扩散模型（diffusion model），通过将图像重光照问题视为重新渲染问题，利用基于物理的渲染引擎（physically-based rendering engine）合成数据集，模拟在不同光照条件下的3D头部资产的变换。为了弥合合成图像与真实图像之间的差距，论文提出了两种策略：(1) 多任务训练（multi-task training），利用无光照标签的真实人像数据进行训练；(2) 基于无分类器引导（classifier-free guidance）的推理时扩散采样过程，利用输入肖像更好地保留细节。该方法能够泛化到多样化的真实照片，并生成逼真的光照效果，包括镜面高光（specular highlights）和投射阴影（cast shadows），同时保持主体的身份特征。

链接: https://arxiv.org/abs/2501.09756
作者: Sumit Chaturvedi,Mengwei Ren,Yannick Hold-Geoffroy,Jingyuan Liu,Julie Dorsey,Zhixin Shu
机构: Yale University(耶鲁大学); Adobe Research(Adobe研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 27 pages, 25 figures, Project Page this https URL

点击查看摘要

Abstract:We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject’s identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \urlthis https URL
zh

[CV-2] Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

【速读】：该论文旨在探索自编码器（auto-encoder）在图像和视频生成模型中的扩展（scaling）问题，特别是自编码器的设计选择如何影响其重建（reconstruction）目标和下游生成性能。尽管基于Transformer的生成器扩展在近期取得了显著进展，但自编码器的扩展问题尚未得到充分研究。论文通过引入一种增强的Vision Transformer架构（ViTok）来替代传统的卷积骨干网络，并在远超ImageNet-1K的大规模图像和视频数据集上进行训练，以消除数据对自编码器扩展的限制。关键解决方案包括：1）研究自编码器瓶颈扩展对重建和生成的影响，发现其与重建高度相关，但与生成的关系更为复杂；2）分别扩展自编码器的编码器和解码器，发现扩展编码器对重建和生成的增益有限，而扩展解码器则显著提升重建性能，但对生成的影响不一。最终，ViTok作为一种轻量级自编码器，在ImageNet-1K和COCO重建任务中表现出色，并在UCF-101的16帧128p视频重建任务中超越现有自编码器，同时与Diffusion Transformers结合后在ImageNet-1K图像生成和UCF-101类条件视频生成任务中达到了新的性能标杆。

链接: https://arxiv.org/abs/2501.09755
作者: Philippe Hansen-Estruch,David Yan,Ching-Yao Chung,Orr Zohar,Jialiang Wang,Tingbo Hou,Tao Xu,Sriram Vishwanath,Peter Vajda,Xinlei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 25 figures, 7 Tables

点击查看摘要

Abstract:Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation – and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders’ encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
zh

[CV-3] Lost in Translation Found in Context: Sign Language Translation with Contextual Cues

【速读】：该论文旨在解决将连续手语（continuous sign language）翻译为口语文本（spoken language text）的问题。解决方案的关键在于引入额外的上下文线索（contextual cues），并结合手语视频进行翻译。具体而言，除了从输入视频中提取的视觉手语识别特征（visual sign recognition features）外，还整合了以下补充文本信息：（i）描述背景表演的字幕（captions），（ii）先前句子的翻译，以及（iii）转录手语的伪注释（pseudo-glosses）。这些信息被自动提取并与视觉特征一起输入到预训练的大语言模型（LLM）中，通过微调生成口语文本翻译。通过广泛的消融实验，论文展示了每种输入线索对翻译性能的积极贡献。该方法在目前最大的英国手语数据集（BOBSL）上进行了训练和评估，结果表明，与之前报道的结果以及作为基线的现有最先进方法相比，该上下文方法显著提高了翻译质量。此外，论文还通过在美国手语数据集（How2Sign）上的应用，展示了该方法的通用性，并取得了具有竞争力的结果。

链接: https://arxiv.org/abs/2501.09754
作者: Youngjoon Jang,Haran Raajesh,Liliane Momeni,Gül Varol,Andrew Zisserman
机构: 1KAIST, Daejeon, Republic of Korea (KAIST, 大田, 韩国); 2CVIT, IIIT Hyderabad, India (CVIT, IIIT 海得拉巴, 印度); 3Visual Geometry Group, University of Oxford, UK (视觉几何组, 牛津大学, 英国); 4LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, France (LIGM, 巴黎高科路桥学校, 古斯塔夫·埃菲尔大学, 法国国家科学研究中心, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL – the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
zh

[CV-4] SRE-Conv: Symmetric Rotation Equivariant Convolution for Biomedical Image Classification

【速读】：该论文旨在解决卷积神经网络（CNNs）在计算机视觉任务中缺乏旋转等变性（rotational equivariance）的问题，特别是在生物医学图像中，这种特性尤为重要，因为这些图像通常没有明确的朝向。现有的方法主要依赖于数据增强或显式模块来捕捉方向信息，但这往往导致训练成本增加或对所需等变性的近似效果不佳。为解决这些问题，论文提出了一种新颖且高效的对称旋转等变卷积（Symmetric Rotation-Equivariant Convolution, SRE-Conv）核，该核能够在学习旋转不变特征的同时压缩模型规模。SRE-Conv核可以轻松集成到任何CNN骨干网络中。通过在MedMNISTv2数据集（包含16个任务）上的验证，SRE-Conv-CNN在2D和3D图像的旋转图像分类任务中均表现出更高的准确率，同时减少了参数数量和内存占用。

链接: https://arxiv.org/abs/2501.09753
作者: Yuexi Du,Jiazhen Zhang,Tal Zeevi,Nicha C. Dvornek,John A. Onofrey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted by IEEE ISBI 2025 4-page paper

点击查看摘要

Abstract:Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at this https URL.
zh

[CV-5] ComplexVAD: Detecting Interaction Anomalies in Video WACV

【速读】：该论文试图解决现有视频异常检测（video anomaly detection）数据集中缺乏复杂异常（complex anomalies）的问题，这些复杂异常通常由物体之间的交互引起。现有数据集的局限性导致研究重点偏向简单异常，影响了相关研究的进展。为解决这一问题，作者提出了一个新的大规模数据集ComplexVAD，并提出了一种基于场景图（scene graph）和时空属性（spatio-temporal attributes）建模物体间交互的新方法，以检测复杂异常。实验结果表明，该方法在ComplexVAD数据集上的表现优于现有的两种先进方法。解决方案的关键在于通过场景图建模物体间的时空交互关系，从而更有效地捕捉复杂异常。

链接: https://arxiv.org/abs/2501.09733
作者: Furkan Mumcu,Michael J. Jones,Yasin Yilmaz,Anoop Cherian
机构: University of South Florida (南佛罗里达大学); Mitsubishi Electric Research Labs (MERL) (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 11 figures, to appear in WACV Workshop ASTAD 2025

点击查看摘要

Abstract:Existing video anomaly detection datasets are inadequate for representing complex anomalies that occur due to the interactions between objects. The absence of complex anomalies in previous video anomaly detection datasets affects research by shifting the focus onto simple anomalies. To address this problem, we introduce a new large-scale dataset: ComplexVAD. In addition, we propose a novel method to detect complex anomalies via modeling the interactions between objects using a scene graph with spatio-temporal attributes. With our proposed method and two other state-of-the-art video anomaly detection methods, we obtain baseline scores on ComplexVAD and demonstrate that our new method outperforms existing works.
zh

[CV-6] Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

【速读】：该论文探讨了扩散模型（diffusion models）在推理时（inference-time）的计算扩展行为，旨在解决如何通过增加推理时的计算资源来进一步提升生成样本质量的问题。与大型语言模型（LLMs）不同，扩散模型通过调整去噪步骤的数量来灵活控制推理时的计算量，但通常在几十步后性能提升趋于平缓。本文提出了一种新的解决方案，即通过搜索问题来识别更好的噪声候选，以优化扩散采样过程。解决方案的关键在于设计空间的两个维度：用于提供反馈的验证器（verifiers）和用于寻找更好噪声候选的算法。通过在类别条件和文本条件图像生成基准上的广泛实验，研究发现增加推理时的计算资源可以显著提升扩散模型生成样本的质量，并且针对不同应用场景可以选择框架中的组件组合以适应复杂的图像特性。

链接: https://arxiv.org/abs/2501.09732
作者: Nanye Ma,Shangyuan Tong,Haolin Jia,Hexiang Hu,Yu-Chuan Su,Mingda Zhang,Xuan Yang,Yandong Li,Tommi Jaakkola,Xuhui Jia,Saining Xie
机构: Google; MIT(麻省理工学院); NYU(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
zh

[CV-7] A Simple Aerial Detection Baseline of Multimodal Language Models

【速读】：该论文旨在解决基于生成式预训练Transformer（Generative Pre-trained Transformer）的多模态语言模型（Multimodal Language Models, MLMs）在遥感（Remote Sensing, RS）领域中应用于空中目标检测（aerial detection）的问题。空中目标检测是一项具有挑战性的任务，要求检测图像中多个类别的所有对象，而现有的遥感多模态语言模型尚未探索这一任务，主要是因为MLMs的自回归预测机制与检测输出存在显著差异。

论文提出的解决方案的关键在于引入了一种归一化方法，将检测输出转换为文本输出，使其与MLMs的框架兼容。此外，论文还提出了一种评估方法，确保MLMs与传统目标检测模型之间的公平比较。通过微调开源通用MLMs，论文构建了一个名为LMMRotate的基线模型，并展示了其与传统检测器相当的检测性能。这一基线模型为未来MLMs的发展提供了参考，有望增强其对遥感图像的全面理解能力。

链接: https://arxiv.org/abs/2501.09720
作者: Qingyun Li,Yushi Chen,Xinya Shu,Dong Chen,Xin He,Yi Yu,Xue Yang
机构: Harbin Institute of Technology(哈尔滨工业大学); Southeast University(东南大学); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 table, 4 figures

点击查看摘要

Abstract:The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
zh

[CV-8] FLOL: Fast Baselines for Real-World Low-Light Enhancement

【速读】：该论文旨在解决低光图像增强（Low-Light Image Enhancement, LLIE）任务中的效率和鲁棒性问题。现有的深度学习方法在实际场景中（如存在噪声、饱和像素和不良光照条件的情况下）表现不佳。论文提出了一种轻量级神经网络FLOL+，通过在频域和空间域结合图像处理，显著提升了处理速度和效果。该方法是目前最快的模型之一，在LOL和LSRW等真实场景数据集上达到了最先进的性能，并且能够在12毫秒内处理1080p图像。

链接: https://arxiv.org/abs/2501.09718
作者: Juan C. Benito,Daniel Feijoo,Alvaro Garcia,Marcos V. Conde
机构: Cidaut AI, Spain; Computer Vision Lab, University of Würzburg(维尔茨堡大学计算机视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Technical Report

点击查看摘要

Abstract:Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the image signal processing literature. However, current deep learning-based solutions struggle with efficiency and robustness in real-world scenarios (e.g. scenes with noise, saturated pixels, bad illumination). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our method, FLOL+, is one of the fastest models for this task, achieving state-of-the-art results on popular real scenes datasets such as LOL and LSRW. Moreover, we are able to process 1080p images under 12ms. Code and models at this https URL
zh

[CV-9] Practical Continual Forgetting for Pre-trained Vision Models

【速读】：该论文旨在解决预训练视觉模型中持续遗忘（continual forgetting）特定信息的问题，即在用户或模型所有者提出删除请求时，能够高效且有效地从模型中移除不需要的知识，同时最小化对剩余知识的影响。论文提出了三个关键挑战：(i) 对不需要的知识进行高效且有效的删除；(ii) 确保遗忘过程对剩余知识的影响最小；(iii) 在实际场景中，训练样本可能稀缺或部分缺失。为解决这些问题，论文提出了Group Sparse LoRA (GS-LoRA)方法，通过引入LoRA模块独立微调Transformer块中的FFN层来实现特定遗忘任务，并采用组稀疏正则化自动选择特定的LoRA组并清零其他组。为进一步扩展至更实际的场景，论文还提出了GS-LoRA++，通过引入原型信息作为额外监督，将遗忘类别的logits远离其原始原型，同时将剩余类别的logits拉近其各自的原型。实验表明，该方法能够在人脸识别、目标检测和图像分类任务中有效遗忘特定类别，同时对其他类别的影响最小。

链接: https://arxiv.org/abs/2501.09705
作者: Hongbo Zhao,Fei Zhu,Bolin Ni,Feng Zhu,Gaofeng Meng,Zhaoxiang Zhang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统国家重点实验室); University of Chinese Academy of Sciences (中国科学院大学); Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences (中国科学院香港创新研究院人工智能与机器人中心); SenseTime Research (商汤科技研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
zh

[CV-10] Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

【速读】：该论文试图解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）中的幻觉（hallucination）问题，即模型生成的响应与输入图像或提示不符的现象。现有的直接偏好优化（Direct Preference Optimization, DPO）方法通过构建反映幻觉严重程度的偏好对来缓解这一问题，但其性能受数据构建方法的影响较大。论文指出，关键问题在于构建的数据是否与DPO的初始（参考）策略保持一致（on-policy），因为从非一致（off-policy）数据中学习会受到更新策略与参考策略之间的KL散度（KL-divergence）的阻碍。为此，论文提出了基于策略对齐的DPO框架（On-Policy Alignment, OPA-DPO），通过引入专家反馈来修正幻觉响应，并以策略对齐的方式同时对齐原始响应和专家修正后的响应。实验表明，OPA-DPO仅使用4.8k数据，便在AMBER和Object-Hal基准上分别实现了13.26%和5.39%的幻觉率降低，优于使用16k样本训练的最先进算法。

链接: https://arxiv.org/abs/2501.09695
作者: Zhihe Yang,Xufang Luo,Dongqi Han,Yunjian Xu,Dongsheng Li
机构: The Chinese University of Hong Kong(香港中文大学); Microsoft Research Asia(微软亚洲研究院); The Chinese University of Hong Kong, Shenzhen Research Institute (SZRI)(香港中文大学深圳研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 15 figures

点击查看摘要

Abstract:Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
zh

[CV-11] Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

【速读】：该论文旨在解决开放词汇部分分割（Open-Vocabulary Part Segmentation, OVPS）中的两个主要挑战：一是部分级别的图像-文本对应关系难以对齐，二是分割对象部分时缺乏结构理解。为解决这些问题，作者提出了PartCATSeg框架，其关键创新点包括：1）采用对象感知的部分级别成本聚合策略，分别处理对象和部分级别的成本，从而提高部分分割的精度；2）引入组合损失（compositional loss），以更好地捕捉部分与对象之间的关系，弥补部分标注的不足；3）利用DINO特征提供结构指导，增强边界划分和部分间关系的理解。实验结果表明，该方法在多个数据集上显著优于现有技术，为未见部分类别的鲁棒泛化设定了新的基准。

链接: https://arxiv.org/abs/2501.09688
作者: Jiho Choi,Seonho Lee,Minhyun Lee,Seungho Lee,Hyunjung Shim
机构: KAIST(韩国科学技术院); Samsung Electronics(三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
zh

[CV-12] Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

【速读】：该论文旨在解决当前视觉-语言模型（Vision-Language Models, VLMs）评估方法中存在的不足，并提出更全面和鲁棒的评估方案。随着VLMs的快速发展，现有的评估方法（包括自动化指标、基于AI的评估和人工评估）在不同任务和规模上存在局限性。为此，作者首先构建了一个名为Robin的新型VLM套件，该套件通过结合不同规模的大语言模型（Large Language Models, LLMs）和视觉编码器（Vision Encoders, VEs），揭示了现有评估方法在多尺度上的缺陷。为了克服这些缺陷，作者进一步提出了CHIRP，一个用于生成长文本响应的新基准，旨在提供更全面和鲁棒的VLM评估。通过公开Robin的训练代码、模型套件和CHIRP基准，作者促进了研究的可重复性，并推动了VLM领域的进一步发展。

链接: https://arxiv.org/abs/2501.09672
作者: Alexis Roger,Prateek Humane,Daniel Z. Kaplan,Kshitij Gupta,Qi Sun,George Adamopoulos,Jonathan Siu Chi Lim,Quentin Anthony,Edwin Fennell,Irina Rish
机构: Mila - Quebec AI Institute; Université de Montréal; realiz.ai; Tokyo Institute of Technology(东京工业大学); McGill University(麦吉尔大学); EleutherAI; University College London(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.
zh

[CV-13] Unified Face Matching and Physical-Digital Spoofing Attack Detection

【速读】：该论文旨在解决人脸识别系统在物理和数字欺骗攻击（spoofing attacks）下的安全性和效率问题。当前的研究通常将人脸识别和攻击检测视为独立的分类任务，导致需要分别实现不同的模型，增加了计算复杂度，特别是在资源有限的设备上，这种低效性限制了系统的可扩展性和性能。为解决这一问题，论文提出了一种创新的统一模型，该模型结合了Swin Transformer骨干网络和卷积神经网络框架中的HiLo注意力机制，能够同时进行人脸识别和物理及数字攻击检测。此外，论文还引入了增强技术，模拟物理和数字欺骗攻击的特征，显著提高了模型的鲁棒性。通过在不同数据集上的全面实验评估，论文展示了该模型在统一人脸识别和欺骗检测中的有效性，并验证了其对未见过的物理和数字欺骗攻击的抵抗力，突显了其在现实应用中的潜力。

链接: https://arxiv.org/abs/2501.09635
作者: Arun Kunwar,Ajita Rattani
机构: Dept. of Computer Science and Engineering, University of North Texas at Denton (北德克萨斯大学丹顿分校计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
zh

[CV-14] WMamba: Wavelet-based Mamba for Face Forgery Detection

【速读】：该论文旨在解决深度伪造（deepfake）生成技术快速发展背景下，现有基于小波分析（wavelet analysis）的人脸伪造检测算法在特征提取和泛化能力方面的不足。现有方法未能充分利用小波分析所揭示的细长、细粒度和全局性的面部轮廓特征，导致特征提取效果欠佳。为解决这一问题，论文提出了WMamba，一种基于Mamba架构的新型小波特征提取器。其关键创新点包括：1）动态轮廓卷积（Dynamic Contour Convolution, DCConv），通过可变形核自适应建模细长的面部轮廓；2）利用Mamba架构以线性计算复杂度捕获长距离空间关系，从而从小图像块中提取细粒度和全局性的伪造痕迹。实验结果表明，WMamba在性能上达到了当前最先进水平，验证了其在人脸伪造检测中的有效性和优越性。

链接: https://arxiv.org/abs/2501.09617
作者: Siran Peng,Tianshuo Zhang,Li Gao,Xiangyu Zhu,Haoyuan Zhang,Kai Pang,Zhen Lei
机构: 1MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); 3China Mobile Communications Company Limited Research Institute (中国移动通信有限公司研究院); 4Guangzhou Pixel Solutions Co., Ltd. (广州像素解决方案有限公司); 5CAIR, HKISI, Chinese Academy of Sciences (中国科学院香港创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
zh

[CV-15] Metric Learning with Progressive Self-Distillation for Audio-Visual Embedding Learning ICASSP2025

【速读】：该论文试图解决现有度量学习（Metric Learning）方法在音频-视觉嵌入学习（audio-visual embedding learning）中存在的问题，即现有方法过度依赖标签引导的表示学习（label-guided representation learning），导致未能充分利用音频和视觉数据分布中潜在的复杂特征和关系，从而影响了嵌入学习的性能。为解决这一问题，论文提出了一种新颖的架构，该架构结合了跨模态三元组损失（cross-modal triplet loss）和渐进自蒸馏（progressive self-distillation）。该方案的关键在于通过利用数据的内在分布，动态优化软音频-视觉对齐（soft audio-visual alignments），即捕捉超出显式标签的音频与视觉数据之间潜在关系的概率对齐。具体而言，模型通过从每个批次的子集中蒸馏出基于音频-视觉分布的知识，并利用这些自蒸馏的知识来增强表示学习的效果。

链接: https://arxiv.org/abs/2501.09608
作者: Donghuo Zeng,Kazushi Ikeda
机构: KDDI Research, Inc.(KDDI研究所)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures, 2 tables. Accepted by ICASSP 2025

点击查看摘要

Abstract:Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments – probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t
zh

[CV-16] Mesh2SLAM in VR: A Fast Geometry-Based SLAM Framework for Rapid Prototyping in Virtual Reality Applications

【速读】：该论文试图解决在资源受限设备（如VR头戴式显示器，VR HMDs）上进行SLAM（Simultaneous Localization and Mapping，同时定位与地图构建）仿真时面临的两个主要挑战：高计算成本和受限的传感器数据访问。为了解决这些问题，论文提出了一种基于稀疏框架的解决方案，该框架利用网格几何投影（mesh geometry projections）作为特征，从而提高了计算效率并避免了直接访问传感器数据的需求。这一方法通过VR实验和数值评估得到了验证，推动了SLAM研究的进展。

链接: https://arxiv.org/abs/2501.09600
作者: Carlos Augusto Pinheiro de Sousa,Heiko Hamann,Oliver Deussen
机构: University of Konstanz (康斯坦茨大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:SLAM is a foundational technique with broad applications in robotics and AR/VR. SLAM simulations evaluate new concepts, but testing on resource-constrained devices, such as VR HMDs, faces challenges: high computational cost and restricted sensor data access. This work proposes a sparse framework using mesh geometry projections as features, which improves efficiency and circumvents direct sensor data access, advancing SLAM research as we demonstrate in VR and through numerical evaluation.
zh

[CV-17] Sequential PatchCore: Anomaly Detection for Surface Inspection using Synthetic Impurities

【速读】：该论文试图解决表面杂质（如水渍、指纹、贴纸）对自动化视觉检测系统性能的负面影响问题。现有合成数据生成技术主要关注生成完美样本和缺陷，而忽略了杂质的引入，导致模型在实际应用中表现不佳。为此，研究提出了一种程序化方法，在合成数据中引入逼真的水渍，以生成与真实数据集对应的合成数据集，并用于训练异常检测模型，研究水渍对模型性能的影响。此外，针对高分辨率图像在异常检测训练中的内存瓶颈问题，论文提出了Sequential PatchCore方法，通过顺序构建核心集（coreset），使得在消费级硬件上训练大尺寸图像成为可能，并支持在不同版本数据集上进行迁移学习。研究结果表明，使用合成数据预训练显式核心集异常检测模型具有显著优势，且通过真实数据进行微调可进一步提升性能。同时，研究还揭示了杂质和标签模糊性对模型性能的负面影响，并通过缺陷召回率提供了工业相关视角的模型性能评估。

链接: https://arxiv.org/abs/2501.09579
作者: Runzhou Mao,Juraj Fulir,Christoph Garth,Petra Gospodnetić
机构: Fraunhofer ITWM (弗劳恩霍夫工业数学研究所); RPTU Kaiserslautern-Landau (凯泽斯劳滕-兰道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems. At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities. This study highlights the importance of considering impurities when generating synthetic data. We introduce a procedural method to include photorealistic water stains in synthetic data. The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training. To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable. This allows us to perform transfer learning using coresets pre-trained on different dataset versions. Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data. We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.
zh

[CV-18] A New Teacher-Reviewer-Student Framework for Semi-supervised 2D Human Pose Estimation

【速读】：该论文试图解决传统2D人体姿态估计（2D human pose estimation）方法中需要大量标注数据的问题，这些标注数据的获取既耗时又昂贵。为了解决这一问题，论文提出了一种半监督的2D人体姿态估计方法，通过利用大量未标注数据和少量标注数据来缓解标注负担。解决方案的关键在于提出了一个新颖的教师-评审者-学生（Teacher-Reviewer-Student）框架，该框架模拟了人类通过回顾历史知识来巩固学习的过程。具体来说，教师模型预测结果以指导学生学习，评审者模型则存储重要的历史参数以提供额外的监督信号。此外，论文还引入了多层次特征学习（Multi-level Feature Learning）策略，利用骨干网络不同阶段的输出来估计热图（heatmap），从而丰富监督信息并有效捕捉关键点之间的关系。最后，论文设计了一种数据增强策略，即关键点混合（Keypoint-Mix），通过混合不同的关键点来扰动姿态信息，从而增强网络对关键点的辨别能力。实验结果表明，该方法在公开数据集上相比现有方法取得了显著改进。

链接: https://arxiv.org/abs/2501.09565
作者: Wulian Yun,Mengshi Qi,Fei Peng,Huadong Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student’s learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network’s ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.
zh

[CV-19] xt-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis

【速读】：该论文旨在解决手术工作流分析（Surgical Workflow Analysis）中依赖大规模标注数据集的问题，这些问题包括高成本、可扩展性差以及对专家标注的依赖。为此，作者提出了Surg-FTDA（Few-shot Text-driven Adaptation）方法，其核心在于通过少量配对的图像-标签数据来处理多种手术工作流分析任务。解决方案的关键包括两个方面：首先，基于Few-shot选择的模态对齐（Few-shot selection-based modality alignment）通过选择少量图像并将其嵌入与下游任务的文本嵌入对齐，从而弥合模态差距；其次，文本驱动的适应（Text-driven adaptation）仅利用文本数据训练解码器，避免了图像-文本配对数据的需求，并将该解码器应用于对齐后的图像嵌入，从而在不依赖显式图像-文本对的情况下完成图像相关任务。实验结果表明，Surg-FTDA在生成任务（如图像描述）和判别任务（如三元组识别和阶段识别）中均优于基线方法，并具有良好的泛化能力。

链接: https://arxiv.org/abs/2501.09555
作者: Tingxuan Chen,Kun Yuan,Vinkle Srivastav,Nassir Navab,Nicolas Padoy
机构: ICube, University of Strasbourg, CNRS, Strasbourg, France; IHU Strasbourg, Strasbourg, France; CAMP, Technische Universität München, Munich, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.09555 [cs.CV] (or arXiv:2501.09555v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.09555 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tingxuan Chen [view email] [v1] Thu, 16 Jan 2025 14:18:06 UTC (890 KB)
zh

[CV-20] Exploring AI-based System Design for Pixel-level Protected Health Information Detection in Medical Images

【速读】：该论文旨在解决医学图像去标识化（de-identification）过程中保护健康信息（Protected Health Information, PHI）检测的挑战。医学图像中的PHI可能存在于图像元数据或嵌入图像像素中，确保这些信息的准确检测是数据共享时保护隐私的关键步骤。然而，现有的基于AI的解决方案缺乏充分的评估，阻碍了开发可靠且鲁棒的工具。论文提出了一种基于AI的PHI检测流程，包含三个关键组件：文本检测、文本提取以及PHI内容分析。通过实验交换视觉和语言模型在流程中的角色，评估了不同配置的性能，并推荐了最适合PHI检测任务的模型组合。

链接: https://arxiv.org/abs/2501.09552
作者: Tuan Truong,Ivo M. Baltruschat,Mark Klemens,Grit Werner,Matthias Lenga
机构: Bayer AG, Berlin, Germany(拜耳股份公司, 柏林, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In progress

点击查看摘要

Abstract:De-identification of medical images is a critical step to ensure privacy during data sharing in research and clinical settings. The initial step in this process involves detecting Protected Health Information (PHI), which can be found in image metadata or imprinted within image pixels. Despite the importance of such systems, there has been limited evaluation of existing AI-based solutions, creating barriers to the development of reliable and robust tools. In this study, we present an AI-based pipeline for PHI detection, comprising three key components: text detection, text extraction, and analysis of PHI content in medical images. By experimenting with exchanging roles of vision and language models within the pipeline, we evaluate the performance and recommend the best setup for the PHI detection task.
zh

[CV-21] AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture

【速读】：该论文试图解决视觉语言模型（VLM）在处理高分辨率图像时产生的冗余视觉标记（visual tokens）问题，这些冗余标记显著降低了模型的效率。为了解决这一问题，论文提出了一种自适应的跨模态注意力混合机制（self-adaptive cross-modality attention mixture mechanism），该机制在预语言模型层（pre-LLM layers）中动态结合视觉显著性（visual saliency）和文本-图像相似性（text-to-image similarity），以选择信息量最大的视觉标记。通过这种方式，论文在不引入额外训练成本的情况下，显著提升了VLM的效率，特别是在高缩减率（reduction rate）的情况下，达到了当前最先进的训练自由加速性能。

链接: https://arxiv.org/abs/2501.09532
作者: Jiayi Han,Liang Du,Yiwen Wu,Xiangguo Zhou,Hongwei Du,Weibo Zheng
机构: Inspur Genersoft Co. Ltd., Inspur Group Co. Ltd.(浪潮集团); Shandong Key Laboratory of Automated Complex Network Software Construction(山东省自动化复杂网络软件构建重点实验室); Interactive Entertainment Group, Tencent Inc.(腾讯互动娱乐集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs’ efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
zh

[CV-22] HydraMix: Multi-Image Feature Mixing for Small Data Image Classification

【速读】：该论文试图解决深度神经网络训练过程中所需大量标注数据集的收集和标注成本高、面临法律和隐私问题等挑战。这些问题在许多实际应用中构成了显著的限制。为解决这一问题，论文提出了HydraMix，一种新颖的架构，通过混合同一类别的多个不同图像生成新的图像组合。HydraMix的关键在于利用基于分割的混合掩码（segmentation-based mixing mask）在特征空间内引导多图像内容的融合，并通过无监督和对抗训练相结合的方式进行优化。这种数据增强方案使得模型能够在非常小的数据集上从头开始训练。实验结果表明，HydraMix在小数据集上的图像分类任务中优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.09504
作者: Christoph Reinders,Frederik Schubert,Bodo Rosenhahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training deep neural networks requires datasets with a large number of annotated examples. The collection and annotation of these datasets is not only extremely expensive but also faces legal and privacy problems. These factors are a significant limitation for many real-world applications. To address this, we introduce HydraMix, a novel architecture that generates new image compositions by mixing multiple different images from the same class. HydraMix learns the fusion of the content of various images guided by a segmentation-based mixing mask in feature space and is optimized via a combination of unsupervised and adversarial training. Our data augmentation scheme allows the creation of models trained from scratch on very small datasets. We conduct extensive experiments on ciFAIR-10, STL-10, and ciFAIR-100. Additionally, we introduce a novel text-image metric to assess the generality of the augmented datasets. Our results show that HydraMix outperforms existing state-of-the-art methods for image classification on small datasets.
zh

[CV-23] AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

【速读】：该论文试图解决在大规模生成式模型（Generative Models）中生成高保真个性化图像时面临的挑战，尤其是在涉及多个特定主体（multiple subjects）的情况下。现有的方法在生成个性化图像时，往往难以同时保持多个主体的高保真度。论文提出的解决方案AnyStory通过“编码-路由”（encode-then-route）的方式来解决这一问题。其关键在于：首先，AnyStory使用了一个通用的强大图像编码器ReferenceNet，结合CLIP视觉编码器（CLIP vision encoder）来实现主体特征的高保真编码；其次，AnyStory采用了一个解耦的实例感知主体路由器（decoupled instance-aware subject router），能够准确感知并预测潜在空间中对应主体的位置，并指导主体条件的注入。这一方法在保留主体细节、对齐文本描述以及多主体个性化方面表现出色。

链接: https://arxiv.org/abs/2501.09503
作者: Junjie He,Yuxiang Tuo,Binghui Chen,Chongyang Zhong,Yifeng Geng,Liefeng Bo
机构: Institute for Intelligent Computing, Alibaba Tongyi Lab (阿里巴巴通义实验室); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report; Project page: this https URL

点击查看摘要

Abstract:Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an “encode-then-route” manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at this https URL .
zh

[CV-24] Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis

【速读】：该论文试图解决多模态情感分析中的两个主要问题：一是现有视频多模态大语言模型（MLLMs）在有效整合音频和识别细微面部微表情方面存在困难；二是缺乏详细的情感分析数据集，限制了多模态情感分析的发展。为解决这些问题，论文提出了两个关键解决方案：首先，引入了一个自审数据集和一个人工审阅数据集，分别包含24,137个粗粒度样本和3,500个带有详细情感标注的手动标注样本，以帮助模型从多样化场景中学习并更好地泛化到实际应用中。其次，论文提出在现有的先进视频MLLM中显式整合面部编码模型，使MLLM能够有效统一音频和细微面部线索以进行情感理解。通过在统一空间中对齐这些特征并在提出的数据集中进行指令调优，论文提出的Omni-Emotion在情感识别和推理任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2501.09502
作者: Qize Yang,Detao Bai,Yi-Xing Peng,Xihan Wei
机构: Tongyi Lab, Alibaba Group(阿里巴巴集团); Sun Yat-sen University, China(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
zh

[CV-25] VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization

【速读】：该论文旨在解决视频着色（video colorization）中的两个主要问题：颜色溢出（color bleeding）和缺乏全面的控制，尤其是在复杂运动或多样语义线索的情况下。现有的视频着色方法往往难以保持时间一致性（temporal consistency）和结构完整性（structural integrity）。为了解决这些问题，作者提出了VanGogh，一个基于多模态扩散（multimodal diffusion）的统一框架。该框架的关键创新包括：1）使用Dual Qformer对齐和融合多模态特征；2）通过深度引导生成过程（depth-guided generation）和光流损失（optical flow loss）减少颜色溢出；3）采用颜色注入策略（color injection strategy）和亮度通道替换（luma channel replacement）来提高泛化能力并减少闪烁伪影（flickering artifacts）。这些设计使得用户能够在生成过程中进行全局和局部控制，从而生成更高质量的着色视频。

链接: https://arxiv.org/abs/2501.09499
作者: Zixun Fang,Zhiheng Liu,Kai Zhu,Yu Liu,Ka Leong Cheng,Wei Zhai,Yang Cao,Zheng-Jun Zha
机构: USTC(中国科学技术大学); HKU(香港大学); HKUST(香港科技大学); Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color this http URL page: this https URL.
zh

[CV-26] Comparison of Various SLAM Systems for Mobile Robot in an Indoor Environment

【速读】：该论文旨在比较和分析基于不同传感器（2D激光雷达、单目相机和ZED立体相机）的多种ROS（机器人操作系统）SLAM（同步定位与地图构建）系统在移动机器人轨迹计算中的性能。研究的关键在于通过开发一个配备常见传感器的移动机器人原型，并在典型办公环境中进行实验，收集所有传感器的数据，并基于这些数据运行多种SLAM系统。具体而言，研究比较了以下SLAM系统：(a) 基于2D激光雷达的GMapping、Hector SLAM和Cartographer；(b) 基于单目相机的LSD SLAM、ORB SLAM和DSO；© 基于立体相机的ZEDfu、RTAB Map、ORB SLAM和S-PTAM。通过在同一数据集上测试所有SLAM方法，并使用适当的度量标准进行比较，研究展示了基于激光雷达的Cartographer SLAM、基于单目相机的ORB SLAM和基于立体相机的RTAB Map方法的优异性能。

链接: https://arxiv.org/abs/2501.09490
作者: Maksim Filipenko,Ilya Afanasyev
机构: Institute of Robotics, Innopolis University (机器人研究所，因诺波利斯大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:This article presents a comparative analysis of a mobile robot trajectories computed by various ROS-based SLAM systems. For this reason we developed a prototype of a mobile robot with common sensors: 2D lidar, a monocular and ZED stereo cameras. Then we conducted experiments in a typical office environment and collected data from all sensors, running all tested SLAM systems based on the acquired dataset. We studied the following SLAM systems: (a) 2D lidar-based: GMapping, Hector SLAM, Cartographer; (b) monocular camera-based: Large Scale Direct monocular SLAM (LSD SLAM), ORB SLAM, Direct Sparse Odometry (DSO); and © stereo camera-based: ZEDfu, Real-Time Appearance-Based Mapping (RTAB map), ORB SLAM, Stereo Parallel Tracking and Mapping (S-PTAM). Since all SLAM methods were tested on the same dataset we compared results for different SLAM systems with appropriate metrics, demonstrating encouraging results for lidar-based Cartographer SLAM, Monocular ORB SLAM and Stereo RTAB Map methods.
zh

[CV-27] he Devil is in the Details: Simple Remedies for Image-to-LiDAR Representation Learning ACCV2024

【速读】：该论文试图解决在自动驾驶中，利用相机和LiDAR（激光雷达）传感器进行图像到LiDAR的知识蒸馏（image-to-LiDAR distillation）时，现有方法在设计上的不足。具体来说，现有研究主要集中在设计损失函数以将预训练的2D图像表示蒸馏到3D模型中，而忽略了其他关键设计元素，如LiDAR坐标系、量化方式以及数据利用率等。这些被忽视的设计元素对3D语义分割和3D目标检测等下游任务的性能有显著影响。

解决方案的关键在于对空间和时间轴上的设计选择进行优化。在空间上，现有方法使用圆柱坐标系和体素大小，但没有考虑这些选择与稀疏卷积层输入接口的副作用，导致3D模型中出现空间量化误差。在时间上，现有方法通过丢弃未同步的数据来简化数据处理，限制了仅使用少量时间同步的数据。论文通过分析这些影响，并提出了针对每个被忽视方面的简单解决方案，显著提升了3D语义分割和3D目标检测的性能。

链接: https://arxiv.org/abs/2501.09485
作者: Wonjun Jo,Kwon Byung-Ki,Kim Ji-Yeon,Hawook Jeong,Kyungdon Joo,Tae-Hyun Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACCV2024

点击查看摘要

Abstract:LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
zh

[CV-28] MonoSOWA: Scalable monocular 3D Object detector Without human Annotations

【速读】：该论文试图解决的是在单目 RGB 相机（monocular RGB camera）下进行三维物体检测（3D object detection）时，传统方法依赖于大量人工标注数据的问题。传统方法需要完全监督的训练设置，这不仅耗时、成本高，而且难以应对日益增长的数据量。论文提出了一种无需领域特定人工标注的方法，通过引入“规范物体空间”（Canonical Object Space），使得该方法能够跨多个数据集和相机设置进行训练，并且能够在未见过的相机设置中直接应用。这一解决方案的关键在于其能够显著扩展可用训练数据的规模，同时提升检测器在不同异构环境下的泛化能力。实验结果表明，该方法在两个标准自动驾驶数据集上优于依赖 2D 人工标注的现有方法。

链接: https://arxiv.org/abs/2501.09481
作者: Jan Skvrna,Lukas Neumann
机构: Czech Technical University (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting the three-dimensional position and orientation of objects using a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. In this paper, we present the first method to train 3D object detectors for monocular RGB cameras without domain-specific human annotations, thus making orders of magnitude more data available for training. Thanks to newly proposed Canonical Object Space, the method can not only exploit data across a variety of datasets and camera setups to train a single 3D detector, but unlike previous work it also works out of the box in previously unseen camera setups. All this is crucial for practical applications, where the data and cameras are extremely heterogeneous. The method is evaluated on two standard autonomous driving datasets, where it outperforms previous works, which, unlike our method, still rely on 2D human annotations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.09481 [cs.CV] (or arXiv:2501.09481v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.09481 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-29] DEFOM-Stereo: Depth Foundation Model Based Stereo Matching

【速读】：该论文旨在解决立体匹配（stereo matching）中的深度估计问题，特别是在遮挡（occlusion）和无纹理区域（non-texture）等现实场景中，传统双目匹配线索难以准确估计视差（disparity）的挑战。为了解决这一问题，论文提出了一种新的框架DEFOM-Stereo，其关键解决方案是将单目相对深度估计模型（monocular relative depth estimation）集成到递归立体匹配框架中。具体而言，在特征提取阶段，通过结合传统卷积神经网络（CNNs）和DEFOM的特征，构建了上下文和匹配特征编码器；在更新阶段，利用DEFOM预测的深度初始化递归视差，并引入尺度更新模块（scale update module）来在正确尺度上优化视差。实验结果表明，DEFOM-Stereo在Scene Flow数据集上表现出与现有最先进方法（SOTA）相当的性能，并在零样本泛化（zero-shot generalization）方面表现尤为突出。此外，该模型在KITTI 2012、KITTI 2015、Middlebury和ETH3D基准测试中均达到了SOTA性能，并在多项指标上排名第一，展示了其卓越的能力。

链接: https://arxiv.org/abs/2501.09466
作者: Hualie Jiang,Zhiqiang Lou,Laiyan Ding,Rui Xu,Minglang Tan,Wenjie Jiang,Rui Huang
机构: Insta360 Research; The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have comparable performance on the Scene Flow dataset with state-of-the-art (SOTA) methods and notably shows much stronger zero-shot generalization. Moreover, DEFOM-Stereo achieves SOTA performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D benchmarks, ranking 1st on many metrics. In the joint evaluation under the robust vision challenge, our model simultaneously outperforms previous models on the individual benchmarks. Both results demonstrate the outstanding capabilities of the proposed model.
zh

[CV-30] RE-POSE: Synergizing Reinforcement Learning-Based Partitioning and Offloading for Edge Object Detection

【速读】：该论文旨在解决在边缘设备上实现实时目标检测（object detection）时面临的挑战，特别是在计算资源有限且需要处理高分辨率视频的情况下。传统方法如输入下采样（input down-sampling）和网络上采样（network up-scaling）往往在检测精度和推理延迟之间难以平衡，导致要么牺牲精度以换取速度，要么增加延迟。为解决这一问题，论文提出了RE-POSE框架，该框架基于强化学习（Reinforcement Learning, RL）驱动的分区和边缘卸载（Edge Offloading）策略，旨在优化资源受限边缘环境中的精度-延迟权衡。其关键解决方案包括：1）基于强化学习的动态聚类算法（RL-Based Dynamic Clustering Algorithm, RL-DCA），根据目标分布和深度神经网络（DNN）的计算特性将视频帧划分为非均匀块；2）并行边缘卸载方案，将这些块分配到多个边缘服务器进行并发处理。实验结果表明，RE-POSE在检测精度和推理延迟方面均显著优于现有方法。

链接: https://arxiv.org/abs/2501.09465
作者: Jianrui Shi,Yong Zhao,Zeyang Cui,Xiaoming Shen,Minhang Zeng,Xiaojie Liu
机构: Department of Computing, The Hong Kong Polytechnic University, Hong Kong(香港理工大学计算系); Pengcheng Laboratory, Shenzhen, China(鹏城实验室, 深圳, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Object detection plays a crucial role in smart video analysis, with applications ranging from autonomous driving and security to smart cities. However, achieving real-time object detection on edge devices presents significant challenges due to their limited computational resources and the high demands of deep neural network (DNN)-based detection models, particularly when processing high-resolution video. Conventional strategies, such as input down-sampling and network up-scaling, often compromise detection accuracy for faster performance or lead to higher inference latency. To address these issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven Partitioning and Edge Offloading framework designed to optimize the accuracy-latency trade-off in resource-constrained edge environments. Our approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that partitions video frames into non-uniform blocks based on object distribution and the computational characteristics of DNNs. Furthermore, a parallel edge offloading scheme is implemented to distribute these blocks across multiple edge servers for concurrent processing. Experimental evaluations show that RE-POSE significantly enhances detection accuracy and reduces inference latency, surpassing existing methods.
zh

[CV-31] Normal-NeRF: Ambiguity-Robust Normal Estimation for Highly Reflective Scenes AAAI2025

【速读】：该论文旨在解决Neural Radiance Fields (NeRF)在重建和渲染高反射场景时遇到的困难，特别是由于镜面反射表面固有的形状模糊性导致的鲁棒性不足问题。现有的方法通常依赖于额外的几何先验来正则化形状预测，但这在复杂场景中可能导致几何过度平滑。论文提出了一种基于透射梯度（transmittance-gradient-based）的法线估计技术，该技术在形状模糊条件下仍能保持鲁棒性。此外，论文还引入了一个双激活密度模块（dual activated densities module），有效弥合了平滑表面法线与尖锐物体边界之间的差距。结合反射感知的外观模型，该方法实现了对高镜面反射和复杂几何结构场景的鲁棒重建和高保真渲染。实验结果表明，该方法在多个数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2501.09460
作者: Ji Shi,Xianghua Ying,Ruohao Guo,Bowei Xing,Wenzhen Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025, code available at this https URL

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) often struggle with reconstructing and rendering highly reflective scenes. Recent advancements have developed various reflection-aware appearance models to enhance NeRF’s capability to render specular reflections. However, the robust reconstruction of highly reflective scenes is still hindered by the inherent shape ambiguity on specular surfaces. Existing methods typically rely on additional geometry priors to regularize the shape prediction, but this can lead to oversmoothed geometry in complex scenes. Observing the critical role of surface normals in parameterizing reflections, we introduce a transmittance-gradient-based normal estimation technique that remains robust even under ambiguous shape conditions. Furthermore, we propose a dual activated densities module that effectively bridges the gap between smooth surface normals and sharp object boundaries. Combined with a reflection-aware appearance model, our proposed method achieves robust reconstruction and high-fidelity rendering of scenes featuring both highly specular reflections and intricate geometric structures. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods on various datasets.
zh

[CV-32] On the Relation between Optical Aperture and Automotive Object Detection

【速读】：该论文探讨了孔径大小和形状对基于深度学习的汽车摄像头系统（如交通标志识别和灯光状态检测）的影响。为了解决合成图像与真实图像之间的领域差距（domain gap），论文提出了一种利用点扩散函数（Point Spread Function, PSF）模拟光学效应的方法。该方案的关键在于通过PSF模拟光学畸变，提升计算机生成场景的真实感，从而提高仿真精度，缩小合成图像与真实图像之间的差异。

链接: https://arxiv.org/abs/2501.09456
作者: Ofer Bar-Shalom,Tzvi Philipp,Eran Kishon
机构: General Motors R&D Division, Software-Defined Vehicle Research (SDVR), General Motors Technical Center (通用汽车研发部门，软件定义车辆研究，通用汽车技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We explore the impact of aperture size and shape on automotive camera systems for deep-learning-based tasks like traffic sign recognition and light state detection. A method is proposed to simulate optical effects using the point spread function (PSF), enhancing realism and reducing the domain gap between synthetic and real-world images. Computer-generated scenes are refined with this technique to model optical distortions and improve simulation accuracy.
zh

[CV-33] Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness

【速读】：该论文旨在解决视觉-语言模型（vision-language models）在面对对抗性视觉扰动（adversarial visual perturbations）时的鲁棒性问题。为了解决这一问题，作者提出了一种名为“双重视觉防御”（double visual defense）的新方法。与以往仅对预训练的CLIP模型进行轻量级对抗微调（adversarial fine-tuning）的方法不同，该研究通过使用大规模网络数据进行从零开始的大规模对抗性视觉-语言预训练（large-scale adversarial vision-language pre-training），并进一步结合对抗性视觉指令微调（adversarial visual instruction tuning）来增强模型的防御能力。通过这一方法，作者开发了两个模型：ΔCLIP和Δ²LLaVA，这些模型在零样本鲁棒性（zero-shot robustness）方面表现出显著提升，并在对抗性防御领域达到了新的最高水平。例如，ΔCLIP在ImageNet-1k上的对抗鲁棒性比之前的最佳模型提高了约20%，而Δ²LLaVA在图像描述任务和视觉问答任务中的鲁棒性分别提升了约30%和20%。此外，这些模型在零样本识别能力、减少幻觉（hallucinations）以及推理性能方面也优于基线模型。

链接: https://arxiv.org/abs/2501.09446
作者: Zeyu Wang,Cihang Xie,Brian Bartoldson,Bhavya Kailkhura
机构: UC Santa Cruz (加州大学圣克鲁兹分校); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, \Delta CLIP and \Delta^2 LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of \Delta CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. %For example, \Delta CLIP surpasses the previous best models on ImageNet-1k by ~20% in terms of adversarial robustness. Similarly, compared to prior art, \Delta^2 LLaVA brings a ~30% robustness improvement to image captioning task and a ~20% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines. Our project page is this https URL.
zh

[CV-34] Scaling up self-supervised learning for improved surgical foundation models

【速读】：该论文试图解决基础模型（Foundation Models）在外科计算机视觉（surgical computer vision）领域应用受限的问题。尽管基础模型在计算机视觉的多个任务中表现出色，但其在外科领域的应用尚未得到充分探索。为此，研究提出了SurgeNetXL，这是一个新型的外科基础模型，旨在通过大规模预训练在外科计算机视觉任务中实现卓越性能。SurgeNetXL的关键解决方案包括：1）使用迄今为止最大的外科数据集进行训练，涵盖超过470万视频帧；2）在四种外科手术和三项任务（语义分割、阶段识别和关键安全视图分类）中实现了显著的性能提升；3）通过优化模型架构、扩展训练时长和调整预训练数据集规模，提升了模型在数据稀缺场景下的泛化能力和鲁棒性。这些创新为外科计算机视觉领域的未来研究提供了全面的框架。

链接: https://arxiv.org/abs/2501.09436
作者: Tim J.M. Jaspers,Ronald L.P.D. de Jong,Yiping Li,Carolus H.J. Kusters,Franciscus H.A. Bakker,Romy C. van Jaarsveld,Gino M. Kuiper,Richard van Hillegersberg,Jelle P. Ruurda,Willem M. Brinkman,Josien P.W. Pluim,Peter H.N. de With,Marcel Breeuwer,Yasmina Al Khalil,Fons van der Sommen
机构: Department of Electrical Engineering, Video Coding & Architectures, Eindhoven University of Technology, Eindhoven, The Netherlands; Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology, Eindhoven, The Netherlands; Department of Surgery, University Medical Center Utrecht, Utrecht, The Netherlands; Department of Oncological Urology, University Medical Center Utrecht, Utrecht, The Netherlands; Department of Urology, Catharina Hospital, Eindhoven, The Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: this https URL.
zh

[CV-35] CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

【速读】：该论文旨在解决当前3D生成模型在生成高质量3D资产（3D assets）时面临的挑战，包括多视角不一致性（multi-view inconsistency）、生成速度慢、低保真度以及表面重建问题。为解决这些问题，论文提出了CaPa框架，其关键解决方案在于采用了两阶段生成过程：首先，通过3D潜在扩散模型（3D latent diffusion model）生成几何结构，确保多视角输入下的结构一致性；其次，利用一种新颖的、模型无关的空间解耦注意力机制（Spatially Decoupled Attention）合成高分辨率纹理（最高可达4K）。此外，CaPa还引入了3D感知的遮挡修复算法（3D-aware occlusion inpainting algorithm），用于填充未纹理化的区域，从而在整个模型中实现一致的结果。该框架能够在30秒内生成高质量的3D资产，适用于商业应用，并在纹理保真度和几何稳定性方面表现出色。

链接: https://arxiv.org/abs/2501.09433
作者: Hwan Heo,Jangyeong Kim,Seongyeong Lee,Jeong A Wi,Junyoung Choi,Sangjun Ahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: project page: this https URL

点击查看摘要

Abstract:The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbfCaPa, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
zh

[CV-36] AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring AAAI2025

【速读】：该论文试图解决3D视觉定位（3D Visual Grounding, 3DVG）领域中的两个主要问题：一是训练数据中文本-3D对的不足和多样性有限；二是现有方法未能有效利用3D视觉空间中的丰富上下文线索（如空间关系）进行定位。为解决这些问题，论文提出了AugRefer方法，其关键创新点包括：1）跨模态增强（cross-modal augmentation），通过将对象放置到3D场景中并利用基础模型生成准确且语义丰富的描述，从而生成多样化的文本-3D对，这些数据可用于增强现有3DVG方法的训练数据；2）语言-空间自适应解码器（language-spatial adaptive decoder），能够根据语言描述和多种3D空间关系自适应地调整潜在的目标对象。实验结果表明，AugRefer在三个基准数据集上显著提升了3D视觉定位的性能。

链接: https://arxiv.org/abs/2501.09428
作者: Xinyi Wang,Na Zhao,Zhiyuan Han,Dan Guo,Xun Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:3D visual grounding (3DVG), which aims to correlate a natural language description with the target object within a 3D scene, is a significant yet challenging task. Despite recent advancements in this domain, existing approaches commonly encounter a shortage: a limited amount and diversity of text3D pairs available for training. Moreover, they fall short in effectively leveraging different contextual clues (e.g., rich spatial relations within the 3D visual space) for grounding. To address these limitations, we propose AugRefer, a novel approach for advancing 3D visual grounding. AugRefer introduces cross-modal augmentation designed to extensively generate diverse text-3D pairs by placing objects into 3D scenes and creating accurate and semantically rich descriptions using foundation models. Notably, the resulting pairs can be utilized by any existing 3DVG methods for enriching their training data. Additionally, AugRefer presents a language-spatial adaptive decoder that effectively adapts the potential referring objects based on the language description and various 3D spatial relations. Extensive experiments on three benchmark datasets clearly validate the effectiveness of AugRefer.
zh

[CV-37] Dynamic Neural Style Transfer for Artistic Image Generation using VGG19

【速读】：该论文旨在解决当前神经风格迁移（neural style transfer）技术中存在的一些关键问题，包括处理时间过长、风格图像选择受限以及无法灵活调整风格权重比例等。为了解决这些问题，作者提出了一种基于VGG19模型的神经风格迁移系统。该系统的关键创新在于能够灵活调整风格权重比例，并通过使用VGG19模型进行特征提取，确保在不损害图像内容完整性的前提下，实现高质量且灵活的风格化处理。此外，该系统还显著减少了处理时间，从而提高了整体效率。

链接: https://arxiv.org/abs/2501.09420
作者: Kapil Kashyap,Mehak Garg,Sean Fargose,Sindhu Nair
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Throughout history, humans have created remark- able works of art, but artificial intelligence has only recently started to make strides in generating visually compelling art. Breakthroughs in the past few years have focused on using convolutional neural networks (CNNs) to separate and manipulate the content and style of images, applying texture synthesis techniques. Nevertheless, a number of current techniques continue to encounter obstacles, including lengthy processing times, restricted choices of style images, and the inability to modify the weight ratio of styles. We proposed a neural style transfer system that can add various artistic styles to a desired image to address these constraints allowing flexible adjustments to style weight ratios and reducing processing time. The system uses the VGG19 model for feature extraction, ensuring high-quality, flexible stylization without compromising content integrity.
zh

[CV-38] owards Robust and Realistic Human Pose Estimation via WiFi Signals

【速读】：该论文旨在解决基于WiFi的人体姿态估计（WiFi-based human pose estimation）中的两个关键问题：跨域差距（cross-domain gap）和结构保真度差距（structural fidelity gap）。跨域差距指的是源域和目标域之间姿态分布的显著差异，而结构保真度差距则表现为预测的骨骼姿态存在拓扑结构失真，如关节错位和骨骼长度比例失调。为解决这些问题，论文提出了一个名为DT-Pose的两阶段框架。第一阶段通过时间一致性对比学习策略（temporal-consistent contrastive learning）和自监督掩码重建操作（self-supervised masking-reconstruction），学习具有域一致性和运动区分性的WiFi特征表示。第二阶段引入了一个结合图卷积网络（Graph Convolution Network, GCN）和Transformer层的姿态解码器，通过探索人体关节之间的局部-全局关系，约束生成骨骼的拓扑结构。实验结果表明，该方法在2D/3D人体姿态估计任务中表现出色，有效解决了上述挑战。

链接: https://arxiv.org/abs/2501.09411
作者: Yang Chen,Jingcai Guo,Song Guo,Jingren Zhou,Dacheng Tao
机构: The Hong Kong Polytechnic University(香港理工大学); Hong Kong University of Science and Technology(香港科技大学); Alibaba Group(阿里巴巴集团); Nanyang Technological University(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
zh

[CV-39] SVIA: A Street View Image Anonymization Framework for Self-Driving Applications ITSC2024

【速读】：该论文试图解决自动驾驶应用中街景图像的隐私保护问题。现有的图像匿名化技术主要集中于人脸和个体的去识别化，但在街景图像中，仅去识别化人脸和个体可能不足以保护隐私，因为车辆、建筑物等街景元素仍可能泄露位置、轨迹等敏感信息。因此，论文提出了一种街景图像匿名化框架（Street View Image Anonymization, SVIA），旨在全面保护用户、行人和车辆的隐私。该框架的核心包括三个关键组件：语义分割器（semantic segmenter）用于将输入图像分割为功能区域，修复器（inpainter）用于生成隐私敏感区域的替代内容，以及协调器（harmonizer）用于无缝拼接修改后的区域以确保视觉一致性。相比现有方法，SVIA在图像生成质量和隐私保护之间实现了更好的平衡，实验结果表明其在两个广泛使用的公共数据集上的五个常见指标上表现优异。

链接: https://arxiv.org/abs/2501.09393
作者: Dongyu Liu,Xuhong Wang,Cen Chen,Yanhao Wang,Shengyue Yao,Yilun Lin
机构: School of Data Science and Engineering, East China Normal University (华东师范大学数据科学与工程学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, 3 tables. Accepted by IEEE ITSC 2024

点击查看摘要

Abstract:In recent years, there has been an increasing interest in image anonymization, particularly focusing on the de-identification of faces and individuals. However, for self-driving applications, merely de-identifying faces and individuals might not provide sufficient privacy protection since street views like vehicles and buildings can still disclose locations, trajectories, and other sensitive information. Therefore, it remains crucial to extend anonymization techniques to street view images to fully preserve the privacy of users, pedestrians, and vehicles. In this paper, we propose a Street View Image Anonymization (SVIA) framework for self-driving applications. The SVIA framework consists of three integral components: a semantic segmenter to segment an input image into functional regions, an inpainter to generate alternatives to privacy-sensitive regions, and a harmonizer to seamlessly stitch modified regions to guarantee visual coherence. Compared to existing methods, SVIA achieves a much better trade-off between image generation quality and privacy protection, as evidenced by experimental results for five common metrics on two widely used public datasets.
zh

[CV-40] Image Segmentation with transformers: An Overview Challenges and Future

【速读】：该论文探讨了图像分割（Image Segmentation）领域中传统卷积神经网络（CNNs）的局限性，包括难以捕捉复杂的空间依赖关系、处理多尺度对象、依赖手动设计的架构组件以及缺乏上下文信息等问题。为了解决这些挑战，论文提出了向基于Transformer架构的模型转变的关键解决方案。Transformer模型通过其自注意力机制（Self-Attention Mechanism）能够更好地捕捉全局上下文信息和长距离依赖关系，从而克服了CNN在这些方面的不足。论文还综述了当前最先进的基于Transformer的分割模型，并讨论了这些模型在解决分割特定挑战中的应用及其解决方案。此外，论文还展望了未来的研究方向，如轻量级架构和增强的数据效率，以进一步推动图像分割技术的发展。

链接: https://arxiv.org/abs/2501.09372
作者: Deepjyoti Chetia,Debasish Dutta,Sanjib Kr Kalita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image segmentation, a key task in computer vision, has traditionally relied on convolutional neural networks (CNNs), yet these models struggle with capturing complex spatial dependencies, objects with varying scales, need for manually crafted architecture components and contextual information. This paper explores the shortcomings of CNN-based models and the shift towards transformer architectures -to overcome those limitations. This work reviews state-of-the-art transformer-based segmentation models, addressing segmentation-specific challenges and their solutions. The paper discusses current challenges in transformer-based segmentation and outlines promising future trends, such as lightweight architectures and enhanced data efficiency. This survey serves as a guide for understanding the impact of transformers in advancing segmentation capabilities and overcoming the limitations of traditional models.
zh

[CV-41] Identification of Traditional Medicinal Plant Leaves Using an effective Deep Learning model and Self-Curated Dataset

【速读】：该论文旨在解决药用植物（medicinal plants）在传统和现代药物生产中的准确识别问题，特别是在阿育吠陀（Ayurveda）这一古老印度医学体系中，由于某些植物在外观上极为相似，导致在采集和提取过程中需要依赖人类专家的干预。为了减少对人工专家的依赖并提高植物识别的准确性，论文提出了一种基于计算机视觉（computer vision）的解决方案。其核心是设计了一个自定义的卷积神经网络（CNN）架构，包含6个卷积层、最大池化层（max-pooling layers）和全连接层（dense layers）。该模型在三个不同的数据集上进行了测试，包括印度药用植物叶片图像数据集（Indian Medicinal Leaves Image Dataset）、MED117药用植物叶片数据集（MED117 Medicinal Plant Leaf Dataset）以及作者自建的数据集，分别达到了99.5%、98.4%和99.7%的准确率，使用了Adam、RMSprop和带动量的SGD等多种优化器。

链接: https://arxiv.org/abs/2501.09363
作者: Deepjyoti Chetia,Sanjib Kr Kalita,Prof Partha Pratim Baruah,Debasish Dutta,Tanaz Akhter
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medicinal plants have been a key component in producing traditional and modern medicines, especially in the field of Ayurveda, an ancient Indian medical system. Producing these medicines and collecting and extracting the right plant is a crucial step due to the visually similar nature of some plants. The extraction of these plants from nonmedicinal plants requires human expert intervention. To solve the issue of accurate plant identification and reduce the need for a human expert in the collection process; employing computer vision methods will be efficient and beneficial. In this paper, we have proposed a model that solves such issues. The proposed model is a custom convolutional neural network (CNN) architecture with 6 convolution layers, max-pooling layers, and dense layers. The model was tested on three different datasets named Indian Medicinal Leaves Image Dataset,MED117 Medicinal Plant Leaf Dataset, and the self-curated dataset by the authors. The proposed model achieved respective accuracies of 99.5%, 98.4%, and 99.7% using various optimizers including Adam, RMSprop, and SGD with momentum.
zh

[CV-42] Strategic Base Representation Learning via Feature Augmentations for Few-Shot Class Incremental Learning WACV2025

【速读】：该论文试图解决少样本类增量学习（Few-shot Class Incremental Learning, FSCIL）中的问题，即在仅使用少量训练样本的情况下，模型在学习新类的同时保持对已学习类的知识。现有方法通常在学习新类时冻结已学习类的参数，这导致已学习类与新类之间的分离效果不佳，进而影响已学习类的性能。为解决这一问题，论文提出了一种基于特征增强的对比学习框架（Feature Augmentation driven Contrastive Learning）。该框架的关键在于通过增强特征向量并为其分配代理标签，扩展特征空间，从而确保新类能够在扩展后的空间中无缝集成。此外，论文采用自监督对比损失（self-supervised contrastive loss）来增强已学习类之间的分离效果。实验结果表明，该框架在CIFAR100、miniImageNet和CUB200三个FSCIL基准数据集上显著优于其他方法，达到了最先进的性能。

链接: https://arxiv.org/abs/2501.09361
作者: Parinita Nema,Vinod K Kurmi
机构: Indian Institute of Science Education and Research Bhopal (印度科学教育与研究学院博帕尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025

点击查看摘要

Abstract:Few-shot class incremental learning implies the model to learn new classes while retaining knowledge of previously learned classes with a small number of training instances. Existing frameworks typically freeze the parameters of the previously learned classes during the incorporation of new classes. However, this approach often results in suboptimal class separation of previously learned classes, leading to overlap between old and new classes. Consequently, the performance of old classes degrades on new classes. To address these challenges, we propose a novel feature augmentation driven contrastive learning framework designed to enhance the separation of previously learned classes to accommodate new classes. Our approach involves augmenting feature vectors and assigning proxy labels to these vectors. This strategy expands the feature space, ensuring seamless integration of new classes within the expanded space. Additionally, we employ a self-supervised contrastive loss to improve the separation between previous classes. We validate our framework through experiments on three FSCIL benchmark datasets: CIFAR100, miniImageNet, and CUB200. The results demonstrate that our Feature Augmentation driven Contrastive Learning framework significantly outperforms other approaches, achieving state-of-the-art performance.
zh

[CV-43] YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks

【速读】：该论文试图解决的问题是如何在多模态AI代理（Multimodal AI Agents）中实现主动干预（proactive intervention），即在特定情况下主动介入以帮助用户纠正任务中的错误或提供指导。现有的AI代理（如大型语言模型LLMs或多模态视觉-语言模型VLMs）通常是反应式的（reactive），只能在接收到用户提示后采取行动，而缺乏主动干预的能力。论文提出的解决方案YETI（YET to Intervene）多模态代理通过增强现实（Augmented Reality, AR）设备提供的第一人称多模态（音频和视频）观察能力，结合场景理解信号（scene understanding signals）和对齐信号（alignment signal），来识别何时需要主动干预。具体来说，YETI代理通过分析连续视频帧的结构相似性（Structural Similarity, SSIM）来理解场景，并学习判断用户行为是否与预期任务一致，从而决定何时主动介入。该方案在HoloAssist多模态基准测试中进行了验证，展示了其在指导用户完成程序性任务时的有效性。

链接: https://arxiv.org/abs/2501.09355
作者: Saptarashmi Bandyopadhyay,Vikas Bahirwani,Lavisha Aggarwal,Bhanu Guda,Lin Li,Andrea Colaco
机构: University of Maryland, College Park(马里兰大学帕克分校); Google(谷歌); Google(谷歌); Google(谷歌); Google(谷歌); Google(谷歌)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: Preprint

点击查看摘要

Abstract:Multimodal AI Agents are AI models that have the capability of interactively and cooperatively assisting human users to solve day-to-day tasks. Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks by providing egocentric multimodal (audio and video) observational capabilities to AI Agents. Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Existing AI Agents, either Large Language Models (LLMs) or Multimodal Vision-Language Models (VLMs) are reactive in nature, which means that models cannot take an action without reading or listening to the human user’s prompts. Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks, encourage users when they do tasks correctly or simply engage in conversation with the user - akin to a human teaching or assisting a user. Our proposed YET to Intervene (YETI) multimodal agent focuses on the research question of identifying circumstances that may require the agent to intervene proactively. This allows the agent to understand when it can intervene in a conversation with human users that can help the user correct mistakes on tasks, like cooking, using AR. Our YETI Agent learns scene understanding signals based on interpretable notions of Structural Similarity (SSIM) on consecutive video frames. We also define the alignment signal which the AI Agent can learn to identify if the video frames corresponding to the user’s actions on the task are consistent with expected actions. These signals are used by our AI Agent to determine when it should proactively intervene. We compare our results on the instances of proactive intervention in the HoloAssist multimodal benchmark for an expert agent guiding a user to complete procedural tasks.
zh

[CV-44] Making Your Dreams A Reality: Decoding the Dreams into a Coherent Video Story from fMRI Signals

【速读】：该论文试图解决的问题是如何将梦境转化为连贯的视频叙事（video narratives），并探索梦境中的视觉体验。解决方案的关键在于结合功能性磁共振成像（fMRI）数据和主观梦境体验，通过三个主要步骤实现这一目标：重建视觉感知（reconstructing visual perception）、解码梦境图像（decoding dream imagery）以及整合梦境故事（integrating dream stories）。通过创新的fMRI分析技术和语言模型（language modeling），该研究旨在突破梦境研究的界限，深入理解睡眠期间的视觉体验，并生成完整的梦境视频叙事。

链接: https://arxiv.org/abs/2501.09350
作者: Yanwei Fu,Jianxiong Gao,Baofeng Yang,Jianfeng Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:This paper studies the brave new idea for Multimedia community, and proposes a novel framework to convert dreams into coherent video narratives using fMRI data. Essentially, dreams have intrigued humanity for centuries, offering glimpses into our subconscious minds. Recent advancements in brain imaging, particularly functional magnetic resonance imaging (fMRI), have provided new ways to explore the neural basis of dreaming. By combining subjective dream experiences with objective neurophysiological data, we aim to understand the visual aspects of dreams and create complete video narratives. Our process involves three main steps: reconstructing visual perception, decoding dream imagery, and integrating dream stories. Using innovative techniques in fMRI analysis and language modeling, we seek to push the boundaries of dream research and gain deeper insights into visual experiences during sleep. This technical report introduces a novel approach to visually decoding dreams using fMRI signals and weaving dream visuals into narratives using language models. We gather a dataset of dreams along with descriptions to assess the effectiveness of our framework.
zh

[CV-45] UVRM: A Scalable 3D Reconstruction Model from Unposed Videos

【速读】：该论文试图解决在3D重建（3D Reconstruction）过程中依赖已知相机姿态（camera poses）的问题，这一问题在传统方法中既耗时又容易出错，限制了3D重建模型只能使用合成3D数据集或小规模带标注姿态的数据集进行训练。为了解决这一问题，论文提出了一种名为UVRM的新型3D重建模型，该模型能够在无需任何姿态信息的情况下，仅通过单目视频（monocular videos）进行训练和评估。UVRM的关键创新在于使用了一个transformer网络，将视频帧隐式聚合到一个姿态不变（pose-invariant）的潜在特征空间中，并通过解码生成三平面（tri-plane）3D表示。此外，UVRM结合了分数蒸馏采样（Score Distillation Sampling, SDS）方法和分析-合成（analysis-by-synthesis）方法，利用预训练的扩散模型逐步合成伪新视角（pseudo novel-views），从而避免了训练过程中对真实姿态标注的需求。实验结果表明，UVRM能够在不依赖姿态信息的情况下，高效且有效地从无姿态视频中重建多种3D物体。

链接: https://arxiv.org/abs/2501.09347
作者: Shiu-hong Kao,Xiao Li,Jinglu Wang,Chi-Keung Tang,Yu-Wing Tai,Yan Lu
机构: Microsoft Research Asia(微软亚洲研究院); Hong Kong University of Science and Technology(香港科技大学); Dartmouth College(达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM’s performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.
zh

[CV-46] SE-BSFV: Online Subspace Learning based Shadow Enhancement and Background Suppression for ViSAR under Complex Background

【速读】：该论文旨在解决视频合成孔径雷达（ViSAR）在移动目标检测（MTD）中，目标阴影与背景低散射区域难以区分的问题，这一问题会导致较高的漏检率和误报率。为了解决这一问题，论文提出了基于低秩表示（LRR）理论和在线子空间学习技术的阴影增强与背景抑制算法（SE-BSFV）。该算法的关键步骤包括：首先，通过配准算法对ViSAR图像进行配准，并利用高斯混合分布（GMD）对ViSAR数据进行建模；其次，利用前一帧的知识估计当前帧的GMD参数，并通过期望最大化（EM）算法估计子空间参数，从而获得当前帧的前景矩阵；最后，使用交替方向乘子法（ADMM）消除前景矩阵中的强散射物体，得到最终结果。实验结果表明，SE-BSFV算法显著增强了阴影的显著性，并在保证效率的同时大幅提升了检测性能。

链接: https://arxiv.org/abs/2501.09341
作者: Shangqu Yan,Chenyang Luo,Yaowen Fu,Wenpeng Zhang,Wei Yang,Ruofeng Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video synthetic aperture radar (ViSAR) has attracted substantial attention in the moving target detection (MTD) field due to its ability to continuously monitor changes in the target area. In ViSAR, the moving targets’ shadows will not offset and defocus, which is widely used as a feature for MTD. However, the shadows are difficult to distinguish from the low scattering region in the background, which will cause more missing and false alarms. Therefore, it is worth investigating how to enhance the distinction between the shadows and background. In this study, we proposed the Shadow Enhancement and Background Suppression for ViSAR (SE-BSFV) algorithm. The SE-BSFV algorithm is based on the low-rank representation (LRR) theory and adopts online subspace learning technique to enhance shadows and suppress background for ViSAR images. Firstly, we use a registration algorithm to register the ViSAR images and utilize Gaussian mixture distribution (GMD) to model the ViSAR data. Secondly, the knowledge learned from the previous frames is leveraged to estimate the GMD parameters of the current frame, and the Expectation-maximization (EM) algorithm is used to estimate the subspace parameters. Then, the foreground matrix of the current frame can be obtained. Finally, the alternating direction method of multipliers (ADMM) is used to eliminate strong scattering objects in the foreground matrix to obtain the final results. The experimental results indicate that the SE-BSFV algorithm significantly enhances the shadows’ saliency and greatly improves the detection performance while ensuring efficiency compared with several other advanced pre-processing algorithms.
zh

[CV-47] Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis

【速读】：该论文旨在解决在细粒度分析中，如何准确识别和定位视觉相似类别（如不同鸟类或犬种）之间的区分特征的问题。现有的方法如Grad-CAM生成的显著性图（saliency maps）通常只能粗略地定位整个物体，而无法精确识别区分特征。为此，论文提出了一种名为Prompt Class Attention Map (Prompt-CAM)的新方法。其关键创新在于通过为预训练的视觉Transformer（Vision Transformers, ViTs）学习类别特定的提示（class-specific prompts），并利用这些提示的输出进行分类。为了正确分类图像，真实类别的提示必须关注其他类别图像中未见的独特图像块（即区分特征），从而通过多头注意力图（multi-head attention maps）揭示这些特征及其位置。Prompt-CAM的实现非常简单，仅需修改视觉提示调优（Visual Prompt Tuning, VPT）的预测头，使其易于训练和应用，显著优于其他需要设计特定模型和训练过程的可解释方法。

链接: https://arxiv.org/abs/2501.09333
作者: Arpita Chowdhury,Dipanjyoti Paul,Zheda Mai,Jianyang Gu,Ziheng Zhang,Kazi Sajeed Mehrab,Elizabeth G. Campolongo,Daniel Rubenstein,Charles V. Stewart,Anuj Karpatne,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao
机构: The Ohio State University(俄亥俄州立大学); University of Tsukuba(筑波大学); Virginia Tech(弗吉尼亚理工大学); Princeton University(普林斯顿大学); Rensselaer Polytechnic Institute(伦斯勒理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes’ images, i.e., traits. As such, the true class’s multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.
zh

[CV-48] Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression ICASSP2025

【速读】：该论文旨在解决基于Transformer的编码器-解码器模型在图像修复任务中计算复杂度高的问题，这一问题主要体现在高FLOPs和参数数量上，限制了其在实际场景中的应用。现有的知识蒸馏方法通常使用轻量级学生模型直接模仿教师的中间特征和重建结果，而忽略了它们之间的隐式注意力关系。为解决这一问题，论文提出了一种软知识蒸馏（Soft Knowledge Distillation, SKD）策略，结合了多维跨网络注意力（Multi-dimensional Cross-net Attention, MCA）机制，用于压缩图像修复模型。该机制通过通道和空间维度促进学生与教师之间的交互，使学生能够隐式学习注意力矩阵。此外，论文采用高斯核函数在核空间中测量学生与教师特征之间的距离，确保稳定且高效的特征学习。为了进一步提升重建图像的质量，论文还引入了对比学习损失函数，替代了常用的L1或KL散度损失。实验结果表明，该SKD策略在显著降低计算复杂度的同时，保持了强大的图像修复能力。

链接: https://arxiv.org/abs/2501.09321
作者: Yongheng Zhang,Danfeng Yan
机构: State Key Laboratory of Networking and Switching Technology, BUPT (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP2025

点击查看摘要

Abstract:Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
zh

[CV-49] Finding the Trigger: Causal Abductive Reasoning on Video Events

【速读】：该论文提出了一个新的问题，即视频事件中的因果溯因推理（Causal Abductive Reasoning on Video Events, CARVE），旨在识别视频中事件之间的因果关系，并生成解释目标事件发生的因果链假设。为了解决这一问题，论文创建了两个新的基准数据集，包含合成和真实视频，并通过一种新颖的反事实合成方法生成触发-目标标签。论文提出的解决方案关键在于开发了一个因果事件关系网络（Causal Event Relation Network, CERN），该网络在时间和语义空间中分析视频事件之间的关系，以有效确定根因触发事件。通过大量实验，论文证明了事件关系表示学习和交互建模在解决视频因果推理挑战中的关键作用。CARVE任务的引入，连同相关数据集和CERN框架，将推动未来视频因果推理的研究，并显著促进视频监控、根因分析和电影内容管理等应用的发展。

链接: https://arxiv.org/abs/2501.09304
作者: Thao Minh Le,Vuong Le,Kien Do,Sunil Gupta,Svetha Venkatesh,Truyen Tran
机构: Applied Artificial Intelligence Institute, Deakin University, Australia(迪肯大学应用人工智能研究所); Amazon, Melbourne, Australia(亚马逊, 墨尔本, 澳大利亚)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a new problem, Causal Abductive Reasoning on Video Events (CARVE), which involves identifying causal relationships between events in a video and generating hypotheses about causal chains that account for the occurrence of a target event. To facilitate research in this direction, we create two new benchmark datasets with both synthetic and realistic videos, accompanied by trigger-target labels generated through a novel counterfactual synthesis approach. To explore the challenge of solving CARVE, we present a Causal Event Relation Network (CERN) that examines the relationships between video events in temporal and semantic spaces to efficiently determine the root-cause trigger events. Through extensive experiments, we demonstrate the critical roles of event relational representation learning and interaction modeling in solving video causal reasoning challenges. The introduction of the CARVE task, along with the accompanying datasets and the CERN framework, will advance future research on video causal reasoning and significantly facilitate various applications, including video surveillance, root-cause analysis and movie content management.
zh

[CV-50] Creating Virtual Environments with 3D Gaussian Splatting: A Comparative Study

【速读】：该论文旨在探讨3D高斯泼溅（3D Gaussian Splatting, 3DGS）技术在虚拟环境（Virtual Environment, VE）创建中的实际应用效果，特别是在扩展现实（Extended Reality, XR）领域的潜力。尽管3DGS被认为是一种创新且高效的三维表示技术，但其在实际应用中的有效性尚未得到充分研究。论文通过比较三种不同的基于3DGS的方法，评估其在创建沉浸式虚拟环境中的可行性，识别其在XR应用中的局限性，并探讨未来的研究和开发方向。解决方案的关键在于利用3DGS的独特优势，实现高效且视觉上引人入胜的场景表示，同时通过对比研究揭示其在实际应用中的挑战和改进空间。

链接: https://arxiv.org/abs/2501.09302
作者: Shi Qiu,Binzhu Xie,Qixuan Liu,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: IEEE VR 2025 Posters

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as an innovative and efficient 3D representation technique. While its potential for extended reality (XR) applications is frequently highlighted, its practical effectiveness remains underexplored. In this work, we examine three distinct 3DGS-based approaches for virtual environment (VE) creation, leveraging their unique strengths for efficient and visually compelling scene representation. By conducting a comparable study, we evaluate the feasibility of 3DGS in creating immersive VEs, identify its limitations in XR applications, and discuss future research and development opportunities.
zh

[CV-51] SoccerSynth-Detection: A Synthetic Dataset for Soccer Player Detection

【速读】：该论文试图解决足球视频分析中球员检测（player detection）所面临的数据集多样性不足和可用性受限的问题。现有的数据集如SoccerNet-Tracking和SportsMOT由于版权限制和多样性缺乏，难以有效支持算法在不同足球视频场景中的适应性。为解决这一问题，论文提出了SoccerSynth-Detection，这是首个专门为合成足球球员检测设计的合成数据集。该数据集通过引入随机光照、纹理以及模拟相机运动模糊（motion blur）等多样化条件，显著提升了算法的泛化能力。实验验证表明，SoccerSynth-Detection在迁移学习和预训练测试中均表现出色，尤其在处理运动模糊图像时显著优于真实数据集。这一解决方案的关键在于利用合成数据集的多样性和可控性，弥补了真实数据集的不足，展示了合成数据集在足球视频分析算法训练中的潜力。

链接: https://arxiv.org/abs/2501.09281
作者: Haobin Qin,Calvin Yeung,Rikuhei Umemoto,Keisuke Fujii
机构: Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In soccer video analysis, player detection is essential for identifying key events and reconstructing tactical positions. The presence of numerous players and frequent occlusions, combined with copyright restrictions, severely restricts the availability of datasets, leaving limited options such as SoccerNet-Tracking and SportsMOT. These datasets suffer from a lack of diversity, which hinders algorithms from adapting effectively to varied soccer video contexts. To address these challenges, we developed SoccerSynth-Detection, the first synthetic dataset designed for the detection of synthetic soccer players. It includes a broad range of random lighting and textures, as well as simulated camera motion blur. We validated its efficacy using the object detection model (Yolov8n) against real-world datasets (SoccerNet-Tracking and SportsMoT). In transfer tests, it matched the performance of real datasets and significantly outperformed them in images with motion blur; in pre-training tests, it demonstrated its efficacy as a pre-training dataset, significantly enhancing the algorithm’s overall performance. Our work demonstrates the potential of synthetic datasets to replace real datasets for algorithm training in the field of soccer video analysis.
zh

[CV-52] xt-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding CVPR

【速读】：该论文试图解决在零样本（zero-shot）3D分类任务中，由于3D数据及其标注的收集成本高、劳动密集，导致数据集有限的问题。为了解决这一问题，论文提出了一种名为文本引导几何增强（Text-guided Geometric Augmentation, TeGA）的方法。TeGA的核心在于利用生成式文本到3D模型（generative text-to-3D model）自动生成文本引导的合成3D数据，并通过一致性过滤策略（consistency filtering strategy）剔除语义与几何形状不匹配的噪声样本。实验表明，TeGA能够有效扩展有限的3D数据集，并在Objaverse-LVIS、ScanObjectNN和ModelNet40数据集上分别实现了3.0%、4.6%和8.7%的零样本分类性能提升，从而为在有限真实训练数据下实现鲁棒的零样本3D分类提供了新的解决方案。

链接: https://arxiv.org/abs/2501.09278
作者: Kohei Torimi,Ryosuke Yamada,Daichi Otsuka,Kensho Hara,Yuki M. Asano,Hirokatsu Kataoka,Yoshimitsu Aoki
机构: Keio University(庆应义塾大学); National Institute of Advanced Industrial Science and Technology (AIST)(产业技术综合研究所); University of Tsukuba(筑波大学); TICO-AIST Cooperative Research Laboratory for Advanced Logistics (ALlab)(TICO-AIST先进物流合作研究实验室); University of Technology Nuremberg(纽伦堡技术大学); University of Oxford(牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, this paper is submitted to CVPR

点击查看摘要

Abstract:Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.
zh

[CV-53] Bias for Action: Video Implicit Neural Representations with Bias Modulation

【速读】：该论文旨在解决视频序列的连续建模问题，提出了一种基于隐式神经表示（Implicit Neural Representations, INRs）的新框架——ActINR。其核心解决方案在于利用INR作为可学习的字典，通过权重控制基函数的形状，偏置控制基函数的位置。论文假设，在紧凑的非线性激活函数下，INR的偏置能够有效捕捉图像间的运动，从而实现视频序列的紧凑表示。具体而言，ActINR通过在视频帧之间共享INR的权重，同时为每一帧分配独特的偏置，并将这些偏置建模为基于时间索引的另一个INR的输出，以促进平滑性。通过联合训练视频INR和偏置INR，ActINR展示了在视频慢动作、空间超分辨率、去噪和视频修复等任务中的独特能力，显著提升了视频处理的性能（通常超过6dB的改进），为视频连续建模设定了新标准。

链接: https://arxiv.org/abs/2501.09277
作者: Alper Kayabasi,Anil Kumar Vadathya,Guha Balakrishnan,Vishwanath Saragadam
机构: University of California Riverside (加州大学河滨分校); Rice University (莱斯大学); Neal Cancer Center, Houston Methodist Hospital (休斯顿卫理公会医院尼尔癌症中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a new continuous video modeling framework based on implicit neural representations (INRs) called ActINR. At the core of our approach is the observation that INRs can be considered as a learnable dictionary, with the shapes of the basis functions governed by the weights of the INR, and their locations governed by the biases. Given compact non-linear activation functions, we hypothesize that an INR’s biases are suitable to capture motion across images, and facilitate compact representations for video sequences. Using these observations, we design ActINR to share INR weights across frames of a video sequence, while using unique biases for each frame. We further model the biases as the output of a separate INR conditioned on time index to promote smoothness. By training the video INR and this bias INR together, we demonstrate unique capabilities, including 10\times video slow motion, 4\times spatial super resolution along with 2\times slow motion, denoising, and video inpainting. ActINR performs remarkably well across numerous video processing tasks (often achieving more than 6dB improvement), setting a new standard for continuous modeling of videos.
zh

[CV-54] Knowledge Distillation for Image Restoration : Simultaneous Learning from Degraded and Clean Images ICASSP2025

【速读】：该论文试图解决在图像修复任务中模型压缩的潜力尚未充分探索的问题。具体来说，传统的知识蒸馏（Knowledge Distillation）方法在分类和分割任务中得到了广泛应用，但在图像到图像转换（image-to-image translation）领域，尤其是图像修复（image restoration）任务中的应用仍显不足。为此，论文提出了一种名为同步学习知识蒸馏（Simultaneous Learning Knowledge Distillation, SLKD）的框架，专门用于图像修复任务的模型压缩。

解决方案的关键在于SLKD框架采用了双教师-单学生（dual-teacher, single-student）架构，并结合了两种不同的学习策略：退化去除学习（Degradation Removal Learning, DRL）和图像重建学习（Image Reconstruction Learning, IRL）。在DRL中，学生编码器从教师A学习，专注于去除退化因素，并通过一种新颖的BRISQUE提取器进行指导；在IRL中，学生解码器从教师B学习，重建干净的图像，并借助提出的PIQE提取器。这两种策略使学生能够同时从退化和干净的图像中学习，从而在保持高质量图像修复性能的同时，显著减少了模型的FLOPs和参数量，压缩率超过80%。

链接: https://arxiv.org/abs/2501.09268
作者: Yongheng Zhang,Danfeng Yan
机构: State Key Laboratory of Networking and Switching Technology, BUPT (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ICASSP2025

点击查看摘要

Abstract:Model compression through knowledge distillation has seen extensive application in classification and segmentation tasks. However, its potential in image-to-image translation, particularly in image restoration, remains underexplored. To address this gap, we propose a Simultaneous Learning Knowledge Distillation (SLKD) framework tailored for model compression in image restoration tasks. SLKD employs a dual-teacher, single-student architecture with two distinct learning strategies: Degradation Removal Learning (DRL) and Image Reconstruction Learning (IRL), simultaneously. In DRL, the student encoder learns from Teacher A to focus on removing degradation factors, guided by a novel BRISQUE extractor. In IRL, the student decoder learns from Teacher B to reconstruct clean images, with the assistance of a proposed PIQE extractor. These strategies enable the student to learn from degraded and clean images simultaneously, ensuring high-quality compression of image restoration models. Experimental results across five datasets and three tasks demonstrate that SLKD achieves substantial reductions in FLOPs and parameters, exceeding 80%, while maintaining strong image restoration performance.
zh

[CV-55] Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites

【速读】：该论文试图解决在建筑工地上使用机器人（robotics）和计算机视觉（computer vision）技术进行机械、电气和管道（MEP）系统监测的适用性问题。具体而言，研究比较了开放词汇视觉-语言模型（open-vocabulary vision-language models）与经过微调的轻量级封闭集目标检测器（fine-tuned, lightweight, closed-set object detectors）在移动地面机器人平台上检测MEP组件的性能。研究的关键解决方案是通过在移动地面机器人平台上收集并手动标注的数据集，评估这两种模型在专业环境和特定任务中的表现。结果表明，尽管视觉-语言模型具有较高的通用性，但在专业环境中，经过微调的轻量级模型在检测MEP组件时仍表现出显著优势。

链接: https://arxiv.org/abs/2501.09267
作者: Abdalwhab Abdalwhab,Ali Imran,Sina Heydarian,Ivanka Iordanova,David St-Onge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:The construction industry has long explored robotics and computer vision, yet their deployment on construction sites remains very limited. These technologies have the potential to revolutionize traditional workflows by enhancing accuracy, efficiency, and safety in construction management. Ground robots equipped with advanced vision systems could automate tasks such as monitoring mechanical, electrical, and plumbing (MEP) systems. The present research evaluates the applicability of open-vocabulary vision-language models compared to fine-tuned, lightweight, closed-set object detectors for detecting MEP components using a mobile ground robotic platform. A dataset collected with cameras mounted on a ground robot was manually annotated and analyzed to compare model performance. The results demonstrate that, despite the versatility of vision-language models, fine-tuned lightweight models still largely outperform them in specialized environments and for domain-specific tasks.
zh

[CV-56] OpticFusion: Multi-Modal Neural Implicit 3D Reconstruction of Microstructures by Fusing White Light Interferometry and Optical Microscopy

【速读】：该论文试图解决传统白光干涉测量（White Light Interferometry, WLI）无法捕捉样品表面自然颜色的问题，这一问题在许多需要同时获取3D几何形状和颜色信息的微观研究应用中至关重要。为解决这一挑战，论文提出了一种名为OpticFusion的新方法，该方法首次从计算机视觉多模态重建的角度出发，通过引入额外的数字光学显微镜（Optical Microscope, OM）来实现带有自然颜色纹理的3D重建。解决方案的关键在于采用两步数据关联过程来获取WLI和OM数据的位姿，并利用神经隐式表示（neural implicit representation）融合多模态数据，结合颜色分解技术提取样品的自然颜色。通过在多模态数据集上的测试，OpticFusion成功实现了带有颜色纹理的详细3D重建，为微观研究领域提供了一种有效的工具。

链接: https://arxiv.org/abs/2501.09259
作者: Shuo Chen,Yijin Li,Guofeng Zhang
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学计算机辅助设计与图形学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph); Instrumentation and Detectors (physics.ins-det); Optics (physics.optics)
备注: 3DV 2025

点击查看摘要

Abstract:White Light Interferometry (WLI) is a precise optical tool for measuring the 3D topography of microstructures. However, conventional WLI cannot capture the natural color of a sample’s surface, which is essential for many microscale research applications that require both 3D geometry and color information. Previous methods have attempted to overcome this limitation by modifying WLI hardware and analysis software, but these solutions are often costly. In this work, we address this challenge from a computer vision multi-modal reconstruction perspective for the first time. We introduce OpticFusion, a novel approach that uses an additional digital optical microscope (OM) to achieve 3D reconstruction with natural color textures using multi-view WLI and OM images. Our method employs a two-step data association process to obtain the poses of WLI and OM data. By leveraging the neural implicit representation, we fuse multi-modal data and apply color decomposition technology to extract the sample’s natural color. Tested on our multi-modal dataset of various microscale samples, OpticFusion achieves detailed 3D reconstructions with color textures. Our method provides an effective tool for practical applications across numerous microscale research fields. The source code and our real-world dataset are available at this https URL.
zh

[CV-57] Leverag ing Scale-aware Representations for improved Concept-Representation Alignment in ViTs

【速读】：该论文试图解决Vision Transformers (ViTs)在敏感视觉应用（如医学诊断、面部识别等）中可解释性不足的问题。当前大多数研究集中在设计模型无关、即插即用的通用概念解释模块，这些模块在训练过程中未充分考虑基础模型的内在机制（如归纳偏置、尺度不变性等）。为解决这一问题，论文提出了一种新颖的概念表示对齐模块（Concept Representation Alignment Module, CRAM），该模块从多尺度特征金字塔和补丁表示中分别学习尺度和位置感知的表示，并通过注意力矩阵将这些表示与概念标注对齐。CRAM不仅提升了ViT架构的预测性能，还在五个数据集（包括三个广泛使用的基准数据集CUB、Pascal APY、Concept-MNIST和两个真实世界数据集AWA2、KITS）上提供了准确且鲁棒的概念解释。

链接: https://arxiv.org/abs/2501.09221
作者: Sanchit Sinha,Guangzhi Xiong,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) are increasingly being adopted in various sensitive vision applications - like medical diagnosis, facial recognition, etc. To improve the interpretability of such models, many approaches attempt to forward-align them with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose a novel Concept Representation Alignment Module (CRAM) which learns both scale and position-aware representations from multi-scale feature pyramids and patch representations respectively. CRAM further aligns these representations with concept annotations through an attention matrix. The proposed CRAM module improves the predictive performance of ViT architectures and also provides accurate and robust concept explanations as demonstrated on five datasets - including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and 2 real-world datasets (AWA2, KITS).
zh

[CV-58] Adaptive Law-Based Transformation (ALT): A Lightweight Feature Representation for Time Series Classification

【速读】：该论文旨在解决时间序列分类（Time Series Classification, TSC）中传统方法在处理复杂和多变时间序列数据时的局限性。传统方法通常难以有效捕捉时间序列中的关键模式，导致分类精度不足。为此，论文提出了一种基于自适应定律的变换（Adaptive Law-based Transformation, ALT）方法，作为对先前线性定律变换（Linear Law-based Transformation, LLT）的改进。ALT的关键创新在于引入了可变长度的滑动时间窗口（variable-length shifted time windows），使其能够捕捉不同长度的区分性模式，从而更有效地处理复杂时间序列。通过将特征映射到线性可分空间，ALT提供了一种快速、鲁棒且透明的解决方案，仅需少量超参数即可实现最先进的分类性能。

链接: https://arxiv.org/abs/2501.09217
作者: Marcell T. Kurbucz,Balázs Hajós,Balázs P. Halmos,Vince Á. Molnár,Antal Jakovác
机构: Wigner Research Centre for Physics (维格纳物理研究中心); Corvinus University of Budapest (布达佩斯考文纽斯大学); Eötvös Loránd University (厄特沃什·罗兰大学); Tampere University (坦佩雷大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 8 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Time series classification (TSC) is fundamental in numerous domains, including finance, healthcare, and environmental monitoring. However, traditional TSC methods often struggle with the inherent complexity and variability of time series data. Building on our previous work with the linear law-based transformation (LLT) - which improved classification accuracy by transforming the feature space based on key data patterns - we introduce adaptive law-based transformation (ALT). ALT enhances LLT by incorporating variable-length shifted time windows, enabling it to capture distinguishing patterns of various lengths and thereby handle complex time series more effectively. By mapping features into a linearly separable space, ALT provides a fast, robust, and transparent solution that achieves state-of-the-art performance with only a few hyperparameters.
zh

[CV-59] Surgical Visual Understanding (SurgVU) Dataset

【速读】：该论文旨在推动外科数据科学（surgical data science）领域的基础研究，特别是在机器学习和机器人辅助手术（robotic-assisted surgeries）背景下。论文的核心解决方案是提供了一个大规模的手术视频数据集及其相应的标注（labels），这些数据是通过机器人辅助手术过程中采集的。该数据集不仅针对特定的科学挑战（如工具检测问题）进行了优化，还具备足够的通用性，可用于广泛的机器学习研究问题。通过公开这些数据，论文希望将机器学习社区引入外科数据科学中的复杂问题，并为未来的研究提供基准（touchstone）。数据集包括手术视频、标注以及用于工具检测问题的验证集，相关资源可通过提供的URL访问。

链接: https://arxiv.org/abs/2501.09209
作者: Aneeq Zia,Max Berniker,Rogerio Nespolo,Conor Perreault,Ziheng Wang,Benjamin Mueller,Ryan Schmidt,Kiran Bhattacharyya,Xi Liu,Anthony Jarc
机构: Intuitive Surgical, Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Owing to recent advances in machine learning and the ability to harvest large amounts of data during robotic-assisted surgeries, surgical data science is ripe for foundational work. We present a large dataset of surgical videos and their accompanying labels for this purpose. We describe how the data was collected and some of its unique attributes. Multiple example problems are outlined. Although the dataset was curated for a particular set of scientific challenges (in an accompanying paper), it is general enough to be used for a broad range machine learning questions. Our hope is that this dataset exposes the larger machine learning community to the challenging problems within surgical data science, and becomes a touchstone for future research. The videos are available at this https URL, the labels at this https URL, and a validation set for tool detection problem at this https URL.
zh

[CV-60] Unified Few-shot Crack Segmentation and its Precise 3D Automatic Measurement in Concrete Structures

【速读】：该论文旨在解决混凝土裂缝检测中现有方法在多样化场景适应性、基于图像的鲁棒性以及对复杂几何形状处理能力方面的不足。为此，研究提出了一种创新的框架，结合计算机视觉技术和多模态同步定位与地图构建（SLAM）技术，实现了二维（2D）裂缝检测、三维（3D）重建和3D自动裂缝测量。解决方案的关键在于：首先，基于DeepLabv3+分割模型，并结合Segment Anything Model（SAM）进行优化，开发了一种具有强泛化能力的裂缝分割方法，能够生成精确的2D裂缝掩码。其次，通过结合激光雷达（LiDAR）点云数据和图像数据，利用图像和LiDAR-SLAM技术，构建了一个多帧多模态融合框架，生成了密集且带有颜色的点云，从而在3D真实世界尺度上有效捕捉裂缝语义。最后，直接在3D密集点云空间中自动测量裂缝几何属性，克服了传统2D图像测量的局限性，使其适用于具有复杂3D几何形状的结构部件。实验结果表明，该方法在实际应用中具有显著的优势，展现了其高效性、准确性和鲁棒性。

链接: https://arxiv.org/abs/2501.09203
作者: Pengru Deng,Jiapeng Yao,Chun Li,Su Wang,Xinrun Li,Varun Ojha,Xuhui He,Takashi Matsumoto
机构: Central South University(中南大学); Hunan Provincial Key Laboratory for Disaster Prevention and Mitigation of Rail Transit Engineering Structures(湖南省轨道交通工程结构防灾减灾重点实验室); Carizon; Nvidia; Newcastle University(纽卡斯尔大学); Hokkaido University(北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.
zh

[CV-61] Grounding Text-To-Image Diffusion Models For Controlled High-Quality Image Generation

【速读】：该论文旨在解决如何通过引入新的边界框（bounding box）能力来增强文本到图像（Text-to-Image, T2I）扩散模型的控制生成过程。现有的布局到图像（layout-to-image）模型虽然能够利用分割图、边缘和人体关键点等多种布局来控制图像生成，但在处理复杂场景时仍存在局限性。为此，论文提出了ObjectDiffusion模型，该模型结合了ControlNet的网络架构和GLIGEN的条件处理与注入技术，通过预训练参数初始化，并在COCO2017数据集上进行微调。ObjectDiffusion在COCO2017验证集上的评估结果显示，其在AP_50、AR和FID三个指标上均优于当前基于开源数据集训练的最先进模型，展示了其在生成多样、高质量、高保真图像方面的独特能力，特别是在语义和空间控制布局上的无缝一致性。

链接: https://arxiv.org/abs/2501.09194
作者: Ahmad Süleyman,Göksel Biricik
机构: Turkish-German University(土耳其-德国大学); Yıldız Technical University(伊斯坦布尔技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale text-to-image (T2I) diffusion models have demonstrated an outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this work, we present ObjectDiffusion, a model that takes inspirations from the top cutting-edge image generative frameworks to seamlessly condition T2I models with new bounding boxes capabilities. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pretraining parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model achieves an AP _50 of 46.6, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in all of the three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple objects of different sizes and locations.
zh

[CV-62] Patch-aware Vector Quantized Codebook Learning for Unsupervised Visual Defect Detection ICTAI2024

【速读】：该论文旨在解决无监督视觉缺陷检测（unsupervised visual defect detection）中的关键问题，即在工业应用中如何构建一个既能有效捕捉正常数据特征，又能准确检测异常数据的表示空间。主要挑战在于平衡表示空间的表达能力（expressiveness）和紧凑性（compactness），因为过于复杂的表示空间可能导致效率低下和模式崩溃（mode collapse），从而影响检测精度。论文提出了一种基于增强型VQ-VAE（Vector Quantized Variational Autoencoder）框架的新方法，通过引入一种基于图像块的动态编码分配机制（patch-aware dynamic code assignment scheme），实现了上下文敏感的编码分配，从而优化了空间表示。这一策略显著提升了正常数据与缺陷数据的区分能力，并在推理过程中提高了检测精度。实验结果表明，该方法在MVTecAD、BTAD和MTSD数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.09187
作者: Qisen Cheng,Shuhui Qu,Janghwan Lee
机构: Samsung Display America Lab (三星显示美国实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, Accepted to 36th IEEE ICTAI 2024

点击查看摘要

Abstract:Unsupervised visual defect detection is critical in industrial applications, requiring a representation space that captures normal data features while detecting deviations. Achieving a balance between expressiveness and compactness is challenging; an overly expressive space risks inefficiency and mode collapse, impairing detection accuracy. We propose a novel approach using an enhanced VQ-VAE framework optimized for unsupervised defect detection. Our model introduces a patch-aware dynamic code assignment scheme, enabling context-sensitive code allocation to optimize spatial representation. This strategy enhances normal-defect distinction and improves detection accuracy during inference. Experiments on MVTecAD, BTAD, and MTSD datasets show our method achieves state-of-the-art performance.
zh

[CV-63] Embodied Scene Understanding for Vision Language Models via MetaVQA

【速读】：该论文试图解决视觉语言模型（Vision Language Models, VLMs）在移动应用中的空间推理和序列决策能力评估缺乏标准化闭环基准的问题。为了解决这一问题，作者提出了MetaVQA，一个通过视觉问答（Visual Question Answering, VQA）和闭环模拟来评估和增强VLMs对空间关系和场景动态理解的综合基准。MetaVQA的关键在于利用Set-of-Mark提示和来自nuScenes及Waymo数据集的俯视图真实标注，自动生成基于多样化真实交通场景的问答对，确保指令以对象为中心且富含上下文信息。实验表明，使用MetaVQA数据集对VLMs进行微调，显著提升了其在安全关键模拟中的空间推理和场景理解能力，不仅体现在VQA准确率的提高上，还体现在新兴的安全意识驾驶操作中。此外，该学习表现出从模拟到现实世界观察的强可迁移性。

链接: https://arxiv.org/abs/2501.09167
作者: Weizhen Wang,Chenda Duan,Zhenghao Peng,Yuxin Liu,Bolei Zhou
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: for the project webpage, see this https URL

点击查看摘要

Abstract:Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs’ understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene comprehension in safety-critical simulations, evident not only in improved VQA accuracies but also in emerging safety-aware driving maneuvers. In addition, the learning demonstrates strong transferability from simulation to real-world observation. Code and data will be publicly available at this https URL .
zh

[CV-64] A Vessel Bifurcation Landmark Pair Dataset for Abdominal CT Deformable Image Registration (DIR) Validation

【速读】：该论文旨在解决腹部CT图像可变形配准（Deformable Image Registration, DIR）算法在临床应用中受限的问题，主要原因是缺乏用于算法开发过程中质量保证的基准数据集。为此，作者提出了首个腹部CT DIR基准数据集，该数据集包含大量高精度的血管分叉点标志对（landmark pairs），用于支持未来算法的开发和验证。解决方案的关键在于：1）通过深度学习模型分割腹部器官并覆盖器官掩膜内的图像强度；2）手动识别每对CT图像之间的匹配图像块；3）在每对图像块中标注血管分叉点标志；4）对图像块进行可变形配准并将标志投影到第二幅图像上；5）通过手动或自动方式精炼标志对位置。最终，该数据集包含1895个标志对，平均每个病例63对，精度估计为0.7±1.2毫米。该数据集的发布为腹部DIR算法的精确验证提供了前所未有的资源。

链接: https://arxiv.org/abs/2501.09162
作者: Edward R Criscuolo,Yao Hao,Zhendong Zhang,Trevor McKeown,Deshan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 19 pages, 3 figures

点击查看摘要

Abstract:Deformable image registration (DIR) is an enabling technology in many diagnostic and therapeutic tasks. Despite this, DIR algorithms have limited clinical use, largely due to a lack of benchmark datasets for quality assurance during development. To support future algorithm development, here we introduce our first-of-its-kind abdominal CT DIR benchmark dataset, comprising large numbers of highly accurate landmark pairs on matching blood vessel bifurcations. Abdominal CT image pairs of 30 patients were acquired from several public repositories as well as the authors’ institution with IRB approval. The two CTs of each pair were originally acquired for the same patient on different days. An image processing workflow was developed and applied to each image pair: 1) Abdominal organs were segmented with a deep learning model, and image intensity within organ masks was overwritten. 2) Matching image patches were manually identified between two CTs of each image pair 3) Vessel bifurcation landmarks were labeled on one image of each image patch pair. 4) Image patches were deformably registered, and landmarks were projected onto the second image. 5) Landmark pair locations were refined manually or with an automated process. This workflow resulted in 1895 total landmark pairs, or 63 per case on average. Estimates of the landmark pair accuracy using digital phantoms were 0.7+/-1.2mm. The data is published in Zenodo at this https URL. Instructions for use can be found at this https URL. This dataset is a first-of-its-kind for abdominal DIR validation. The number, accuracy, and distribution of landmark pairs will allow for robust validation of DIR algorithms with precision beyond what is currently available.
zh

[CV-65] Few-Shot Adaptation of Training-Free Foundation Model for 3D Medical Image Segmentation

【速读】：该论文试图解决在医学图像分割任务中，现有的基础模型（如Segment Anything Model, SAM）由于主要基于自然图像训练，缺乏医学领域的专业知识，导致其在医学图像分割中面临挑战的问题。具体挑战包括需要大量专业医学数据集进行微调，以及对人工提示的依赖，这些过程既耗时又需要医学专家的干预。

论文提出的解决方案是Few-shot Adaptation of Training-frEe SAM (FATE-SAM)，该方法通过重新组装SAM2的预训练模块，实现了少样本适应（few-shot adaptation），利用少量支持样本捕捉解剖学知识，并在无需微调模型的情况下进行无提示分割。此外，FATE-SAM引入了Volumetric Consistency机制，以增强3D切片之间的空间一致性，从而更好地处理医学图像的体积特性。实验结果表明，FATE-SAM在多个医学影像数据集上表现出色，能够在不依赖大规模标注数据集和专家干预的情况下，提供鲁棒且准确的分割结果。

链接: https://arxiv.org/abs/2501.09138
作者: Xingxin He,Yifan Hu,Zhaoye Zhou,Mohamed Jarraya,Fang Liu
机构: Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital (麻省总医院阿西努拉·A·马丁诺斯生物医学影像中心); Harvard Medical School (哈佛医学院); Department of Radiology, Massachusetts General Hospital (麻省总医院放射科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models have achieved remarkable progress across various image analysis tasks. In the image segmentation task, foundation models like the Segment Anything Model (SAM) enable generalizable zero-shot segmentation through user-provided prompts. However, SAM primarily trained on natural images, lacks the domain-specific expertise of medical imaging. This limitation poses challenges when applying SAM to medical image segmentation, including the need for extensive fine-tuning on specialized medical datasets and a dependency on manual prompts, which are both labor-intensive and require intervention from medical experts. This work introduces the Few-shot Adaptation of Training-frEe SAM (FATE-SAM), a novel method designed to adapt the advanced Segment Anything Model 2 (SAM2) for 3D medical image segmentation. FATE-SAM reassembles pre-trained modules of SAM2 to enable few-shot adaptation, leveraging a small number of support examples to capture anatomical knowledge and perform prompt-free segmentation, without requiring model fine-tuning. To handle the volumetric nature of medical images, we incorporate a Volumetric Consistency mechanism that enhances spatial coherence across 3D slices. We evaluate FATE-SAM on multiple medical imaging datasets and compare it with supervised learning methods, zero-shot SAM approaches, and fine-tuned medical SAM methods. Results show that FATE-SAM delivers robust and accurate segmentation while eliminating the need for large annotated datasets and expert intervention. FATE-SAM provides a practical, efficient solution for medical image segmentation, making it more accessible for clinical applications. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.09138 [cs.CV] (or arXiv:2501.09138v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.09138 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-66] Deep Self-Supervised Disturbance Mapping with the OPERA Sentinel-1 Radiometric Terrain Corrected SAR Backscatter Product

【速读】：该论文旨在解决利用合成孔径雷达（Synthetic Aperture Radar, SAR）数据进行地表扰动（land surface disturbances）监测时面临的两个主要挑战：一是SAR数据处理复杂且需要大量计算资源，特别是在全球范围内进行分析时；二是SAR数据标注耗时且成本高昂。为解决这些问题，论文提出了一种基于自监督视觉变换器（self-supervised vision transformer）的方法，该方法无需标注数据即可训练模型。具体而言，研究团队利用NASA的OPERA项目发布的近全球辐射地形校正SAR后向散射（Radiometric Terrain Corrected SAR backscatter, RTC-S1）数据集，通过自监督学习生成基线影像的像素级分布模型，并在检测到显著偏离该分布时评估地表扰动。该方法在三种不同地区的自然灾害事件中进行了测试，结果表明其能够高质量地描绘扰动区域，F1分数超过0.6，精确率-召回率曲线下面积超过0.65，优于现有的SAR扰动监测方法。这一解决方案的关键在于自监督视觉变换器的应用，使其能够在无标注数据的情况下实现全球范围内的扰动监测。

链接: https://arxiv.org/abs/2501.09129
作者: Harris Hardiman-Mostow,Charles Marshak,Alexander L. Handwerger
机构: University of California - Los Angeles (加州大学洛杉矶分校); Jet Propulsion Laboratory, California Institute of Technology (喷气推进实验室, 加州理工学院); Joint Institute for Regional Earth System Science and Engineering, University of California - Los Angeles (加州大学洛杉矶分校区域地球系统科学与工程联合研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 19 pages, 18 figures, 5 tables. Preprint. Submitted to JSTARS

点击查看摘要

Abstract:Mapping land surface disturbances supports disaster response, resource and ecosystem management, and climate adaptation efforts. Synthetic aperture radar (SAR) is an invaluable tool for disturbance mapping, providing consistent time-series images of the ground regardless of weather or illumination conditions. Despite SAR’s potential for disturbance mapping, processing SAR data to an analysis-ready format requires expertise and significant compute resources, particularly for large-scale global analysis. In October 2023, NASA’s Observational Products for End-Users from Remote Sensing Analysis (OPERA) project released the near-global Radiometric Terrain Corrected SAR backscatter from Sentinel-1 (RTC-S1) dataset, providing publicly available, analysis-ready SAR imagery. In this work, we utilize this new dataset to systematically analyze land surface disturbances. As labeling SAR data is often prohibitively time-consuming, we train a self-supervised vision transformer - which requires no labels to train - on OPERA RTC-S1 data to estimate a per-pixel distribution from the set of baseline imagery and assess disturbances when there is significant deviation from the modeled distribution. To test our model’s capability and generality, we evaluate three different natural disasters - which represent high-intensity, abrupt disturbances - from three different regions of the world. Across events, our approach yields high quality delineations: F1 scores exceeding 0.6 and Areas Under the Precision-Recall Curve exceeding 0.65, consistently outperforming existing SAR disturbance methods. Our findings suggest that a self-supervised vision transformer is well-suited for global disturbance mapping and can be a valuable tool for operational, near-global disturbance monitoring, particularly when labeled data does not exist.
zh

[CV-67] Generative Medical Image Anonymization Based on Latent Code Projection and Optimization

【速读】：该论文旨在解决医学图像匿名化（medical image anonymization）问题，即在保护患者隐私的同时，保留数据的实用性以支持下游任务。解决方案的关键在于两阶段方法：潜在代码投影（latent code projection）和优化（optimization）。在投影阶段，作者设计了一个简化的编码器，将输入图像投影到潜在空间，并通过协同训练方案增强投影过程。在优化阶段，使用两个深度损失函数对潜在代码进行优化，以解决身份保护与医学图像数据实用性之间的权衡问题。通过一系列定性和定量实验，作者展示了该方法在MIMIC-CXR胸部X射线数据集上的有效性，生成的匿名合成图像可用于训练肺部病理检测模型。

链接: https://arxiv.org/abs/2501.09114
作者: Huiyu Li,Nicholas Ayache,Hervé Delingette
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Conference

点击查看摘要

Abstract:Medical image anonymization aims to protect patient privacy by removing identifying information, while preserving the data utility to solve downstream tasks. In this paper, we address the medical image anonymization problem with a two-stage solution: latent code projection and optimization. In the projection stage, we design a streamlined encoder to project input images into a latent space and propose a co-training scheme to enhance the projection process. In the optimization stage, we refine the latent code using two deep loss functions designed to address the trade-off between identity protection and data utility dedicated to medical images. Through a comprehensive set of qualitative and quantitative experiments, we showcase the effectiveness of our approach on the MIMIC-CXR chest X-ray dataset by generating anonymized synthetic images that can serve as training set for detecting lung pathologies. Source codes are available at this https URL.
zh

[CV-68] Salient Information Preserving Adversarial Training Improves Clean and Robust Accuracy

【速读】：该论文试图解决传统对抗训练（adversarial training）中存在的鲁棒性-准确性权衡问题。传统对抗训练在提高模型对对抗样本的鲁棒性时，往往会牺牲模型在干净样本上的准确性。论文提出的解决方案是显著信息保留对抗训练（Salient Information Preserving Adversarial Training, SIP-AT），其关键思想是利用图像的显著区域（salient regions）来指导对抗训练过程，确保在训练过程中保留那些被标注者认为有意义且脆弱的特征。通过这种方式，模型能够在保持整体鲁棒性的同时，学习到具有高度预测性的非鲁棒特征。该方法兼容基于人类标注和自动生成的显著区域估计，使其能够在不依赖额外人类数据的情况下应用于人类驱动的模型开发。实验结果表明，SIP-AT能够在多个数据集和架构上提升模型的干净样本准确性，同时保持对多种epsilon级别攻击的高鲁棒性。

链接: https://arxiv.org/abs/2501.09086
作者: Timothy Redgrave,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work we introduce Salient Information Preserving Adversarial Training (SIP-AT), an intuitive method for relieving the robustness-accuracy trade-off incurred by traditional adversarial training. SIP-AT uses salient image regions to guide the adversarial training process in such a way that fragile features deemed meaningful by an annotator remain unperturbed during training, allowing models to learn highly predictive non-robust features without sacrificing overall robustness. This technique is compatible with both human-based and automatically generated salience estimates, allowing SIP-AT to be used as a part of human-driven model development without forcing SIP-AT to be reliant upon additional human data. We perform experiments across multiple datasets and architectures and demonstrate that SIP-AT is able to boost the clean accuracy of models while maintaining a high degree of robustness against attacks at multiple epsilon levels. We complement our central experiments with an observational study measuring the rate at which human subjects successfully identify perturbed images. This study helps build a more intuitive understanding of adversarial attack strength and demonstrates the heightened importance of low-epsilon robustness. Our results demonstrate the efficacy of SIP-AT and provide valuable insight into the risks posed by adversarial samples of various strengths.
zh

[CV-69] SHYI: Action Support for Contrastive Learning in High-Fidelity Text-to-Image Generation

【速读】：该论文旨在解决文本到图像生成（text-to-image generation）中涉及多个对象的动作描述不准确的问题。现有的CONFORM框架通过对比学习（Contrastive Learning）提高了生成图像中多个对象的准确性，但在涉及多个不同对象的动作描述方面仍有较大改进空间。为此，作者提出了语义超图对比邻接学习（semantically hypergraphic contrastive adjacency learning），结合增强的对比结构和“对比但链接”（contrast but link）技术，进一步改进了Stable Diffusion对动作的理解，并通过InteractDiffusion进行优化。评估指标包括图像-文本相似度（CLIP）和TIFA，并进行了用户研究。实验结果表明，该方法在Stable Diffusion理解较弱的动词上也表现出了良好的效果。

链接: https://arxiv.org/abs/2501.09055
作者: Tianxiang Xia,Lin Xiao,Yannick Montorfani,Francesco Pavia,Enis Simsar,Thomas Hofmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main content 4 pages

点击查看摘要

Abstract:In this project, we address the issue of infidelity in text-to-image generation, particularly for actions involving multiple objects. For this we build on top of the CONFORM framework which uses Contrastive Learning to improve the accuracy of the generated image for multiple objects. However the depiction of actions which involves multiple different object has still large room for improvement. To improve, we employ semantically hypergraphic contrastive adjacency learning, a comprehension of enhanced contrastive structure and “contrast but link” technique. We further amend Stable Diffusion’s understanding of actions by InteractDiffusion. As evaluation metrics we use image-text similarity CLIP and TIFA. In addition, we conducted a user study. Our method shows promising results even with verbs that Stable Diffusion understands mediocrely. We then provide future directions by analyzing the results. Our codebase can be found on polybox under the link: this https URL Comments: Main content 4 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.09055 [cs.CV] (or arXiv:2501.09055v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.09055 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-70] Polyp detection in colonoscopy images using YOLOv11

【速读】：该论文旨在探讨最新发布的YOLOv11模型在结直肠息肉检测中的有效性。结直肠癌（CRC）是全球常见的癌症之一，早期息肉检测对于预防CRC至关重要。传统的结肠镜检查依赖于专家手动分析内窥镜图像，而随着机器学习的兴起，深度学习模型在息肉检测中表现出更高的效果，尤其是其在泛化和学习细微特征方面的优势。论文重点研究了YOLOv11的五个不同版本（YOLO11n、YOLO11s、YOLO11m、YOLO11l、YOLO11x）在Kvasir数据集上的性能表现。解决方案的关键在于利用YOLOv11这一单阶段目标检测模型，因其具有较低的推理时间，适合快速目标检测。研究还对比了原始数据集和通过数据增强技术生成的数据集，以分析模型在不同数据条件下的表现。

链接: https://arxiv.org/abs/2501.09051
作者: Alok Ranjan Sahoo,Satya Sangram Sahoo,Pavan Chakraborty
机构: Department of CSIT, ITER, SOA University (SOA大学); Department of CSIT, SOA University (SOA大学); Department of IT, IIIT Allahabad (印度信息技术学院阿拉哈巴德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Colorectal cancer (CRC) is one of the most commonly diagnosed cancers all over the world. It starts as a polyp in the inner lining of the colon. To prevent CRC, early polyp detection is required. Colonosopy is used for the inspection of the colon. Generally, the images taken by the camera placed at the tip of the endoscope are analyzed by the experts manually. Various traditional machine learning models have been used with the rise of machine learning. Recently, deep learning models have shown more effectiveness in polyp detection due to their superiority in generalizing and learning small features. These deep learning models for object detection can be segregated into two different types: single-stage and two-stage. Generally, two stage models have higher accuracy than single stage ones but the single stage models have low inference time. Hence, single stage models are easy to use for quick object detection. YOLO is one of the singlestage models used successfully for polyp detection. It has drawn the attention of researchers because of its lower inference time. The researchers have used Different versions of YOLO so far, and with each newer version, the accuracy of the model is increasing. This paper aims to see the effectiveness of the recently released YOLOv11 to detect polyp. We analyzed the performance for all five models of YOLOv11 (YOLO11n, YOLO11s, YOLO11m, YOLO11l, YOLO11x) with Kvasir dataset for the training and testing. Two different versions of the dataset were used. The first consisted of the original dataset, and the other was created using augmentation techniques. The performance of all the models with these two versions of the dataset have been analysed.
zh

[CV-71] Generating Realistic Synthetic Head Rotation Data for Extended Reality using Deep Learning

【速读】：该论文试图解决扩展现实（Extended Reality, XR）系统中用户头部旋转预测所需的大量定向输入数据获取成本高的问题。由于用户头部旋转导致的运动对内容生成和传输有重要影响，系统需要准确预测即将发生的旋转。然而，获取这些数据通常需要大量人类测试对象，成本较高。论文提出的解决方案是使用基于时间生成对抗网络（TimeGAN）的头部旋转时间序列生成器，通过合成数据生成方法扩展有限的实测数据集。该生成器能够生成与实测时间序列分布高度匹配的新样本，从而在减少数据采集成本的同时，提供足够的数据用于训练和评估预测模型。

链接: https://arxiv.org/abs/2501.09050
作者: Jakob Struye,Filip Lemic,Jeroen Famaey
机构: University of Antwerp - imec (安特卫普大学 - imec); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published and presented at International Conference on Multimedia 2022 (ACMMM), Workshop on Interactive eXtended Reality (IXR)

点击查看摘要

Abstract:Extended Reality is a revolutionary method of delivering multimedia content to users. A large contributor to its popularity is the sense of immersion and interactivity enabled by having real-world motion reflected in the virtual experience accurately and immediately. This user motion, mainly caused by head rotations, induces several technical challenges. For instance, which content is generated and transmitted depends heavily on where the user is looking. Seamless systems, taking user motion into account proactively, will therefore require accurate predictions of upcoming rotations. Training and evaluating such predictors requires vast amounts of orientational input data, which is expensive to gather, as it requires human test subjects. A more feasible approach is to gather a modest dataset through test subjects, and then extend it to a more sizeable set using synthetic data generation methods. In this work, we present a head rotation time series generator based on TimeGAN, an extension of the well-known Generative Adversarial Network, designed specifically for generating time series. This approach is able to extend a dataset of head rotations with new samples closely matching the distribution of the measured time series.
zh

[CV-72] Anthropomorphic Features for On-Line Signatures

【速读】：该论文旨在解决在线签名验证（on-line signature verification）中特征提取的效率和鲁棒性问题。传统的特征提取方法主要依赖于签名样本的位置和动态属性，这些属性通常由数字平板记录。然而，这些方法可能无法充分捕捉签名过程中涉及的上肢运动特征。为此，论文提出了一种基于虚拟骨骼臂模型（Virtual Skeletal Arm, VSA）的新特征空间，该模型通过模拟真实手臂和前臂的结构，描述签名过程中肩部、肘部和腕部关节的运动。具体而言，VSA模型通过3D关节位置和关节角度来描述运动，这些特征通过VSA的正向和直接运动学模型从笔的位置和方向中计算得出。该方法的创新之处在于引入了人体形态学特征（anthropomorphic features），并通过在多个第三方签名数据库上的实验验证了其鲁棒性和性能，达到了当前最先进水平。

链接: https://arxiv.org/abs/2501.09048
作者: Moises Diaz,Miguel A. Ferrer,Jose J. Quintana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many features have been proposed in on-line signature verification. Generally, these features rely on the position of the on-line signature samples and their dynamic properties, as recorded by a tablet. This paper proposes a novel feature space to describe efficiently on-line signatures. Since producing a signature requires a skeletal arm system and its associated muscles, the new feature space is based on characterizing the movement of the shoulder, the elbow and the wrist joints when signing. As this motion is not directly obtained from a digital tablet, the new features are calculated by means of a virtual skeletal arm (VSA) model, which simulates the architecture of a real arm and forearm. Specifically, the VSA motion is described by its 3D joint position and its joint angles. These anthropomorphic features are worked out from both pen position and orientation through the VSA forward and direct kinematic model. The anthropomorphic features’ robustness is proved by achieving state-of-the-art performance with several verifiers and multiple benchmarks on third party signature databases, which were collected with different devices and in different languages and scripts.
zh

[CV-73] Spatio-Temporal Foundation Models: Vision Challenges and Opportunities

【速读】：该论文旨在解决时空基础模型（Spatio-Temporal Foundation Models, STFMs）在关键领域（如交通、公共卫生和环境监测）中尚未取得与视觉和语言任务中基础模型相当成功的问题。论文提出了STFMs未来发展的愿景，明确了其基本特征和广泛适用性所需的泛化能力。解决方案的关键在于识别当前研究中的差距，并针对这些差距提出关键挑战和潜在的研究方向，以推动STFMs的有效性和广泛应用。

链接: https://arxiv.org/abs/2501.09045
作者: Adam Goodge,Wee Siong Ng,Bryan Hooi,See Kiong Ng
机构: Institute for Infocomm Research (信息通信研究院); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局); School of Computing, National University of Singapore (新加坡国立大学计算机学院); Institute of Data Science, National University of Singapore (新加坡国立大学数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Foundation models have revolutionized artificial intelligence, setting new benchmarks in performance and enabling transformative capabilities across a wide range of vision and language tasks. However, despite the prevalence of spatio-temporal data in critical domains such as transportation, public health, and environmental monitoring, spatio-temporal foundation models (STFMs) have not yet achieved comparable success. In this paper, we articulate a vision for the future of STFMs, outlining their essential characteristics and the generalization capabilities necessary for broad applicability. We critically assess the current state of research, identifying gaps relative to these ideal traits, and highlight key challenges that impede their progress. Finally, we explore potential opportunities and directions to advance research towards the aim of effective and broadly applicable STFMs.
zh

[CV-74] CMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification

【速读】：该论文旨在解决无监督行人重识别（unsupervised person re-identification）中的两个关键问题：1）由于ViT（Vision Transformer）通过图像块嵌入（patch embedding）处理图像时引入的噪声（patch noises）问题；2）现有基于记忆库（memory bank）的对比学习方法由于批次大小限制导致的数据不一致性（feature inconsistency）问题。此外，现有伪标签方法通常会丢弃难以聚类的离群样本（outlier samples），牺牲了这些样本的潜在价值，限制了模型的多样性和鲁棒性。

论文提出的解决方案包括两个关键部分：1）ViT Token Constraint（ViT 令牌约束），用于减轻图像块噪声对ViT架构的损害；2）Multi-scale Memory bank（多尺度记忆库），用于增强对离群样本的探索并保持特征一致性。实验结果表明，该方法在常见基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2501.09044
作者: Zheng-An Zhu,Hsin-Che Chien,Chen-Kuo Chiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \hrefthis https URLthis https URL.
zh

[CV-75] CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

【速读】：该论文试图解决的是基于烹饪步骤生成一系列程序化图像（cooking procedural image generation）的挑战。这一任务要求生成的图像不仅需要与烹饪步骤保持一致，还需要在序列上保持一致性，以增强烹饪体验并提供视觉指导。解决方案的关键在于提出了CookingDiffusion方法，该方法结合了Stable Diffusion和三种创新的记忆网络（Memory Nets），用于建模程序化提示（procedural prompts）。这些提示包括文本提示（代表烹饪步骤）、图像提示（对应烹饪图像）以及多模态提示（混合烹饪步骤和图像），从而确保生成的烹饪程序化图像在序列上的一致性。通过预处理YouCookII数据集并建立新的基准，实验结果表明，CookingDiffusion在生成高质量且序列一致的烹饪程序化图像方面表现出色，并通过FID和提出的平均程序一致性（Average Procedure Consistency）指标进行了验证。

链接: https://arxiv.org/abs/2501.09042
作者: Yuan Wang,Bin Xhu,Yanbin Hao,Chong-Wah Ngo,Yi Tan,Xiang Wang
机构: University of Science and Technology of China (中国科学技术大学); Singapore Management University (新加坡管理大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-image generation models have excelled in creating diverse and realistic images. This success extends to food imagery, where various conditional inputs like cooking styles, ingredients, and recipes are utilized. However, a yet-unexplored challenge is generating a sequence of procedural images based on cooking steps from a recipe. This could enhance the cooking experience with visual guidance and possibly lead to an intelligent cooking simulation system. To fill this gap, we introduce a novel task called \textbfcooking procedural image generation. This task is inherently demanding, as it strives to create photo-realistic images that align with cooking steps while preserving sequential consistency. To collectively tackle these challenges, we present \textbfCookingDiffusion, a novel approach that leverages Stable Diffusion and three innovative Memory Nets to model procedural prompts. These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images. To validate the effectiveness of our approach, we preprocess the YouCookII dataset, establishing a new benchmark. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images with remarkable consistency across sequential cooking steps, as measured by both the FID and the proposed Average Procedure Consistency metrics. Furthermore, CookingDiffusion demonstrates the ability to manipulate ingredients and cooking methods in a recipe. We will make our code, models, and dataset publicly accessible.
zh

[CV-76] Pseudolabel guided pixels contrast for domain adaptive semantic segmentation

【速读】：该论文试图解决语义分割（Semantic Segmentation）中无监督域适应（Unsupervised Domain Adaptation, UDA）的问题，特别是在使用虚拟数据训练模型并适应无标签的真实数据时，现有方法在对比学习（Contrastive Learning）中未能充分考虑类内特征的多样性，导致类别预测错误。论文提出的解决方案是引入一种名为“伪标签引导的像素对比”（Pseudo-label Guided Pixel Contrast, PGPC）的新框架。该框架通过利用目标图像的更多信息，同时避免伪标签引入的噪声，克服了现有方法的局限性。实验表明，PGPC在GTA5到Cityscapes和SYNTHIA到Cityscapes两个标准UDA基准任务上分别实现了5.1%和4.6%的mIoU相对提升，且无需增加模型复杂度即可提升其他UDA方法的性能。

链接: https://arxiv.org/abs/2501.09040
作者: Jianzi Xiang,Cailu Wan,Zhu Cao
机构: The Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai 200237, China (华东理工大学能源化工过程智能制造教育部重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 5 figures. Code: this https URL

点击查看摘要

Abstract:Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at this https URL
zh

[CV-77] Do generative video models learn physical principles from watching videos?

【速读】：该论文试图解决的问题是当前视频生成模型是否能够通过学习“世界模型”（world models）来理解物理规律，还是仅仅作为复杂的像素预测器（pixel predictors）实现视觉上的逼真效果而不具备对现实物理原理的理解。为了解决这一问题，作者开发了一个名为Physics-IQ的综合基准数据集，该数据集要求模型必须深入理解流体动力学（fluid dynamics）、光学（optics）、固体力学（solid mechanics）、磁学（magnetism）和热力学（thermodynamics）等多种物理原理才能解决。通过测试多种当前主流视频生成模型（如Sora、Runway、Pika等），作者发现这些模型在物理理解方面存在严重局限，且物理理解与视觉逼真度无关。尽管如此，部分测试案例仍能成功解决，表明仅通过观察可能获得某些物理原理，但仍面临重大挑战。该研究的关键在于通过Physics-IQ数据集揭示了视觉逼真度并不等同于物理理解，并为未来模型的发展提供了重要参考。

链接: https://arxiv.org/abs/2501.09038
作者: Saman Motamed,Laura Culp,Kevin Swersky,Priyank Jaini,Robert Geirhos
机构: INSAIT Sofia University; Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn ``world models’’ that discover laws of physics – or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at this https URL; code at this https URL.
zh

[CV-78] Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces

【速读】：该论文试图解决现有社交机器人导航技术在将感知（perception）转化为符合社会规范的行为（socially compliant actions）方面存在的显著差距。现有的方法主要依赖于手工规则或人类示范，难以在动态环境中实现类似人类推理的自然转换。论文提出利用视觉-语言模型（Vision-Language Models, VLMs）通过语言来弥合感知与社会意识行为之间的差距。关键解决方案是创建了一个名为SNEI（Social robot Navigation via Explainable Interactions）的视觉-语言数据集，包含40K人类标注的视觉问答（Visual Question Answers, VQAs），基于2K次在非结构化、拥挤的公共空间中的人机交互，涵盖感知、预测、链式思维推理、行为和解释。通过微调VLM模型Social-LLaVA，论文展示了该数据集的实际应用，并在50个VQA任务中超越了GPT-4V和Gemini等最先进模型，实现了在动态公共空间中通过语言推理实现符合社会规范的机器人导航。

链接: https://arxiv.org/abs/2501.09024
作者: Amirreza Payandeh,Daeun Song,Mohammad Nazeri,Jing Liang,Praneel Mukherjee,Amir Hossain Raj,Yangzhe Kong,Dinesh Manocha,Xuesu Xiao
机构: George Mason University(乔治梅森大学); University of Maryland, College Park(马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Most existing social robot navigation techniques either leverage hand-crafted rules or human demonstrations to connect robot perception to socially compliant actions. However, there remains a significant gap in effectively translating perception into socially compliant actions, much like how human reasoning naturally occurs in dynamic environments. Considering the recent success of Vision-Language Models (VLMs), we propose using language to bridge the gap in human-like reasoning between perception and socially aware robot actions. We create a vision-language dataset, Social robot Navigation via Explainable Interactions (SNEI), featuring 40K human-annotated Visual Question Answers (VQAs) based on 2K human-robot social interactions in unstructured, crowded public spaces, spanning perception, prediction, chain-of-thought reasoning, action, and explanation. We fine-tune a VLM, Social-LLaVA, using SNEI to demonstrate the practical application of our dataset. Social-LLaVA outperforms state-of-the-art models like GPT-4V and Gemini, based on the average of fifteen different human-judge scores across 50 VQA. Deployed onboard a mobile robot, Social-LLaVA enables human-like reasoning, marking a promising step toward socially compliant robot navigation in dynamic public spaces through language reasoning.
zh

[CV-79] PISCO: Self-Supervised k-Space Regularization for Improved Neural Implicit k-Space Representations of Dynamic MRI

【速读】：该论文试图解决在动态磁共振成像（MRI）中，由于采集时间减少导致训练数据不足，进而引发过拟合（overfitting）问题，从而导致神经隐式k空间表示（Neural Implicit k-space Representations, NIK）性能显著下降的问题。为解决这一问题，论文提出了一种新颖的自监督k空间损失函数 (\mathcal{L}_\mathrm{PISCO})，该函数基于并行成像启发的自一致性（Parallel Imaging-Inspired Self-Consistency, PISCO）概念，能够在不需要额外数据的情况下，强制保持全局k空间邻域关系的一致性。通过在静态和动态MRI重建中的定量和定性评估，结果表明，集成PISCO显著提升了NIK的重建质量，特别是在高加速因子（R ≥ 54）下，NIK结合PISCO在时空重建质量上优于现有最先进方法。此外，对损失函数假设和稳定性的广泛分析表明，PISCO具有作为多功能自监督k空间损失函数的潜力，适用于更多应用和架构。

链接: https://arxiv.org/abs/2501.09403
作者: Veronika Spieker,Hannah Eichhorn,Wenqi Huang,Jonathan K. Stelter,Tabita Catalan,Rickmer F. Braren,Daniel Rueckert,Francisco Sahli Costabal,Kerstin Hammernik,Dimitrios C. Karampinos,Claudia Prieto,Julia A. Schnabel
机构: Institute of Machine Learning for Biomedical Imaging, Helmholtz Munich, Neuherberg, Germany; School of Computation, Information and Technology, Technical University of Munich, Germany; Millenium Institute for Intelligent Healthcare Engineering, Santiago, Chile; School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany; Department of Computing, Imperial College London, London, United Kingdom; School of Engineering, Pontificia Universidad Católica de Chile, Santiago, Chile; School of Biomedical Imaging and Imaging Sciences, King’s College London, London, United Kingdom
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Neural implicit k-space representations (NIK) have shown promising results for dynamic magnetic resonance imaging (MRI) at high temporal resolutions. Yet, reducing acquisition time, and thereby available training data, results in severe performance drops due to overfitting. To address this, we introduce a novel self-supervised k-space loss function \mathcalL_\mathrmPISCO , applicable for regularization of NIK-based reconstructions. The proposed loss function is based on the concept of parallel imaging-inspired self-consistency (PISCO), enforcing a consistent global k-space neighborhood relationship without requiring additional data. Quantitative and qualitative evaluations on static and dynamic MR reconstructions show that integrating PISCO significantly improves NIK representations. Particularly for high acceleration factors (R \geq 54), NIK with PISCO achieves superior spatio-temporal reconstruction quality compared to state-of-the-art methods. Furthermore, an extensive analysis of the loss assumptions and stability shows PISCO’s potential as versatile self-supervised k-space loss function for further applications and architectures. Code is available at: this https URL
zh

[CV-80] Joint Transmission and Deblurring: A Semantic Communication Approach Using Events

【速读】：该论文试图解决在有限信道带宽下，如何有效传输由运动模糊（motion blur）引起的模糊图像以及事件相机（event cameras）生成的大量数据，并实现高质量图像重建的问题。现有的大多数基于深度学习的联合源信道编码（JSCC）方法主要关注清晰图像的传输，忽略了实际场景中由于相机抖动或快速移动物体导致的运动模糊问题。运动模糊会显著降低图像质量，增加传输和重建的难度。

解决方案的关键在于提出了一种新颖的JSCC框架，用于联合传输模糊图像和事件数据。该框架通过提取并分别传输RGB相机和事件相机捕获的共享信息和领域特定信息，避免了重复传输共享信息。在接收端，接收到的信号通过去模糊解码器进行处理，生成清晰的图像。此外，论文还引入了多阶段训练策略来优化模型。实验结果表明，该方法在有效处理运动模糊的同时，显著优于现有的基于JSCC的图像传输方案。

链接: https://arxiv.org/abs/2501.09396
作者: Pujing Yang,Guangyi Zhang,Yunlong Cai,Lei Yu,Guanding Yu
机构: College of Information Science and Electronic Engineering, Zhejiang University (浙江大学信息与电子工程学院); School of Electronic Information, Wuhan University (武汉大学电子信息学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based joint source-channel coding (JSCC) is emerging as a promising technology for effective image transmission. However, most existing approaches focus on transmitting clear images, overlooking real-world challenges such as motion blur caused by camera shaking or fast-moving objects. Motion blur often degrades image quality, making transmission and reconstruction more challenging. Event cameras, which asynchronously record pixel intensity changes with extremely low latency, have shown great potential for motion deblurring tasks. However, the efficient transmission of the abundant data generated by event cameras remains a significant challenge. In this work, we propose a novel JSCC framework for the joint transmission of blurry images and events, aimed at achieving high-quality reconstructions under limited channel bandwidth. This approach is designed as a deblurring task-oriented JSCC system. Since RGB cameras and event cameras capture the same scene through different modalities, their outputs contain both shared and domain-specific information. To avoid repeatedly transmitting the shared information, we extract and transmit their shared information and domain-specific information, respectively. At the receiver, the received signals are processed by a deblurring decoder to generate clear images. Additionally, we introduce a multi-stage training strategy to train the proposed model. Simulation results demonstrate that our method significantly outperforms existing JSCC-based image transmission schemes, addressing motion blur effectively.
zh

[CV-81] Domain-conditioned and Temporal-guided Diffusion Modeling for Accelerated Dynamic MRI Reconstruction

【速读】：该论文旨在解决动态磁共振成像（MRI）重建中的加速问题，特别是在时间分辨的多线圈笛卡尔（Cartesian）和非笛卡尔（non-Cartesian）数据中，如何更好地捕捉时空信息。论文提出了一种称为动态扩散建模（dDiMo）的方法，该方法通过引入时间引导的扩散过程来表征时空信息。解决方案的关键在于将时间信息整合到扩散建模框架中，同时捕获帧内空间特征和帧间时间动态。此外，dDiMo采用了额外的时空（x-t）和自洽的频率-时间（k-t）先验来引导扩散过程，确保时间对齐的精确性并增强图像细节的恢复。通过非线性共轭梯度算法在反向扩散步骤中的应用，进一步优化了扩散过程的平滑性。实验结果表明，dDiMo在不同加速因子下均能实现高质量的重建，并在时间对齐和结构恢复方面优于其他竞争性重建方法。

链接: https://arxiv.org/abs/2501.09305
作者: Liping Zhang,Iris Yuwen Zhou,Sydney B. Montesi,Li Feng,Fang Liu
机构: Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School (马萨诸塞州总医院和哈佛医学院 Athinoula A. Martinos 生物医学影像中心); Division of Pulmonary and Critical Care Medicine, Massachusetts General Hospital and Harvard Medical School (马萨诸塞州总医院和哈佛医学院肺病和重症医学科); Center for Advanced Imaging Innovation and Research, Department of Radiology, New York University Grossman School of Medicine (纽约大学格罗斯曼医学院放射学系高级影像创新与研究中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 21 pages, 15 figures, 2 tables

点击查看摘要

Abstract:Purpose: To propose a domain-conditioned and temporal-guided diffusion modeling method, termed dynamic Diffusion Modeling (dDiMo), for accelerated dynamic MRI reconstruction, enabling diffusion process to characterize spatiotemporal information for time-resolved multi-coil Cartesian and non-Cartesian data. Methods: The dDiMo framework integrates temporal information from time-resolved dimensions, allowing for the concurrent capture of intra-frame spatial features and inter-frame temporal dynamics in diffusion modeling. It employs additional spatiotemporal ( x - t ) and self-consistent frequency-temporal ( k - t ) priors to guide the diffusion process. This approach ensures precise temporal alignment and enhances the recovery of fine image details. To facilitate a smooth diffusion process, the nonlinear conjugate gradient algorithm is utilized during the reverse diffusion steps. The proposed model was tested on two types of MRI data: Cartesian-acquired multi-coil cardiac MRI and Golden-Angle-Radial-acquired multi-coil free-breathing lung MRI, across various undersampling rates. Results: dDiMo achieved high-quality reconstructions at various acceleration factors, demonstrating improved temporal alignment and structural recovery compared to other competitive reconstruction methods, both qualitatively and quantitatively. This proposed diffusion framework exhibited robust performance in handling both Cartesian and non-Cartesian acquisitions, effectively reconstructing dynamic datasets in cardiac and lung MRI under different imaging conditions. Conclusion: This study introduces a novel diffusion modeling method for dynamic MRI reconstruction.
zh

[CV-82] Cancer-Net PCa-Seg: Benchmarking Deep Learning Models for Prostate Cancer Segmentation Using Synthetic Correlated Diffusion Imaging

【速读】：该论文旨在解决前列腺癌（PCa）病变分割中的挑战，特别是传统筛查方法如前列腺特异性抗原（PSA）检测和磁共振成像（MRI）在特异性和泛化性方面的局限性。论文提出了一种新的MRI模态——合成相关扩散成像（synthetic correlated diffusion imaging, CDI^s），并探索了其在PCa病变分割中的潜力。解决方案的关键在于利用多种先进的深度学习模型（包括U-Net、SegResNet、Swin UNETR、Attention U-Net和LightM-UNet）对200名患者的CDI^s数据进行病变分割。研究发现，SegResNet在分割性能上表现最佳，Dice-Sorensen系数（DSC）达到76.68 ± 0.8，而Attention U-Net在准确性和计算效率之间取得了良好的平衡（DSC 74.82 ± 2.0）。这些结果表明，深度学习模型结合CDI^s可以显著提升PCa病变分割的准确性，从而改善PCa的管理和临床支持。

链接: https://arxiv.org/abs/2501.09185
作者: Jarett Dewbury,Chi-en Amy Tai,Alexander Wong
机构: University of Waterloo (滑铁卢大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, to be published in Studies in Computational Intelligence. This paper introduces Cancer-Net PCa-Seg, a comprehensive evaluation of deep learning models for prostate cancer segmentation using synthetic correlated diffusion imaging (CDI $^s$ ). We benchmark five state-of-the-art architectures: U-Net, SegResNet, Swin UNETR, Attention U-Net, and LightM-UNet

点击查看摘要

Abstract:Prostate cancer (PCa) is the most prevalent cancer among men in the United States, accounting for nearly 300,000 cases, 29% of all diagnoses and 35,000 total deaths in 2024. Traditional screening methods such as prostate-specific antigen (PSA) testing and magnetic resonance imaging (MRI) have been pivotal in diagnosis, but have faced limitations in specificity and generalizability. In this paper, we explore the potential of enhancing PCa lesion segmentation using a novel MRI modality called synthetic correlated diffusion imaging (CDI ^s ). We employ several state-of-the-art deep learning models, including U-Net, SegResNet, Swin UNETR, Attention U-Net, and LightM-UNet, to segment PCa lesions from a 200 CDI ^s patient cohort. We find that SegResNet achieved superior segmentation performance with a Dice-Sorensen coefficient (DSC) of 76.68 \pm 0.8 . Notably, the Attention U-Net, while slightly less accurate (DSC 74.82 \pm 2.0 ), offered a favorable balance between accuracy and computational efficiency. Our findings demonstrate the potential of deep learning models in improving PCa lesion segmentation using CDI ^s to enhance PCa management and clinical support.
zh

[CV-83] Deep Distance Map Regression Network with Shape-aware Loss for Imbalanced Medical Image Segmentation

【速读】：该论文试图解决医学图像分析中小目标分割（如肿瘤分割）的难题。尽管基于深度学习的方法在该领域取得了显著进展，但这些方法通常局限于使用二值分割掩码（binary segmentation mask），限制了分割的精度和信息量。论文的关键解决方案是引入距离图（distance map）作为新的真实标签（ground truth），并提出了一种新的分割框架，该框架结合了现有的二值分割网络和一个轻量级回归网络（LR-Net）。通过将距离图的计算转化为回归任务，LR-Net能够充分利用距离图中的丰富信息。此外，论文还提出了一种形状感知损失函数（shape-aware loss），利用距离图作为惩罚图来推断目标的完整形状。实验结果表明，该方法在MICCAI 2017肝脏肿瘤分割挑战数据集（LiTS）和临床数据集上优于基于分类的方法及其他现有先进方法。

链接: https://arxiv.org/abs/2501.09116
作者: Huiyu Li,Xiabi Liu,Said Boumaraf,Xiaopeng Gong,Donghai Liao,Xiaohong Ma
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Conference

点击查看摘要

Abstract:Small object segmentation, like tumor segmentation, is a difficult and critical task in the field of medical image analysis. Although deep learning based methods have achieved promising performance, they are restricted to the use of binary segmentation mask. Inspired by the rigorous mapping between binary segmentation mask and distance map, we adopt distance map as a novel ground truth and employ a network to fulfill the computation of distance map. Specially, we propose a new segmentation framework that incorporates the existing binary segmentation network and a light weight regression network (dubbed as LR-Net). Thus, the LR-Net can convert the distance map computation into a regression task and leverage the rich information of distance maps. Additionally, we derive a shape-aware loss by employing distance maps as penalty map to infer the complete shape of an object. We evaluated our approach on MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge dataset and a clinical dataset. Experimental results show that our approach outperforms the classification-based methods as well as other existing state-of-the-arts.
zh

[CV-84] Relation U-Net

【速读】：该论文旨在解决医学图像分割任务中，如何在缺乏真实标签（ground-truth）的情况下，对分割结果的置信度进行估计的问题。传统的分割网络（如vanilla U-Net）仅输出分割结果，而无法提供对结果可靠性的评估。为此，作者提出了一种名为Relation U-Net的新型神经网络，其关键创新在于能够同时输出多个输入图像的分割图及其相互关系图（pairwise relations），并通过这些关系图的差异来估计测试图像的置信度分数。实验结果表明，Relation U-Net不仅在分割精度上优于传统U-Net，还能提供与分割精度线性相关的置信度分数，从而为临床决策提供更可靠的依据。

链接: https://arxiv.org/abs/2501.09101
作者: Sheng He,Rina Bao,P. Ellen Grant,Yangming Ou
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: ISIB 2025

点击查看摘要

Abstract:Towards clinical interpretations, this paper presents a new ‘‘output-with-confidence’’ segmentation neural network with multiple input images and multiple output segmentation maps and their pairwise relations. A confidence score of the test image without ground-truth can be estimated from the difference among the estimated relation maps. We evaluate the method based on the widely used vanilla U-Net for segmentation and our new model is named Relation U-Net which can output segmentation maps of the input images as well as an estimated confidence score of the test image without ground-truth. Experimental results on four public datasets show that Relation U-Net can not only provide better accuracy than vanilla U-Net but also estimate a confidence score which is linearly correlated to the segmentation accuracy on test images.
zh

[CV-85] Self Pre-training with Adaptive Mask Autoencoders for Variable-Contrast 3D Medical Imaging

【速读】：该论文试图解决在医学影像分析中，现有方法无法处理输入图像数量变化的问题，尤其是在磁共振成像（MRI）研究中，每个受试者的3D输入对比图像数量通常不一致。为了解决这一限制，作者提出了一种3D自适应掩码自编码器（3D Adaptive Masked Autoencoders, AMAE）架构，该架构能够适应每个受试者不同数量的3D输入对比图像。解决方案的关键在于通过自监督预训练，利用掩码自编码器（MAE）从部分掩码的输入中重建完整图像，从而使视觉变换器（ViT）编码器能够聚合上下文信息来预测缺失区域。实验结果表明，这种自适应掩码自编码器的自监督预训练能够显著提升基于ViT的分割模型在梗死分割任务中的性能，提升幅度为2.8%-3.7%。

链接: https://arxiv.org/abs/2501.09096
作者: Badhan Kumar Das,Gengyan Zhao,Han Liu,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, ISBI 2025 accepted

点击查看摘要

Abstract:The Masked Autoencoder (MAE) has recently demonstrated effectiveness in pre-training Vision Transformers (ViT) for analyzing natural images. By reconstructing complete images from partially masked inputs, the ViT encoder gathers contextual information to predict the missing regions. This capability to aggregate context is especially important in medical imaging, where anatomical structures are functionally and mechanically linked to surrounding regions. However, current methods do not consider variations in the number of input images, which is typically the case in real-world Magnetic Resonance (MR) studies. To address this limitation, we propose a 3D Adaptive Masked Autoencoders (AMAE) architecture that accommodates a variable number of 3D input contrasts per subject. A magnetic resonance imaging (MRI) dataset of 45,364 subjects was used for pretraining and a subset of 1648 training, 193 validation and 215 test subjects were used for finetuning. The performance demonstrates that self pre-training of this adaptive masked autoencoders can enhance the infarct segmentation performance by 2.8%-3.7% for ViT-based segmentation models.
zh

[CV-86] Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI Reconstruction

【速读】：该论文试图解决动态磁共振成像（Dynamic MRI）重建中的两个主要问题：一是传统方法在获取真实数据（ground truth data）时的实际困难，二是现有基于隐式神经表示（Implicit Neural Representation, INR）的方法在优化时间过长和超参数调优需求过多方面的不足。为解决这些问题，论文提出了一种名为动态感知隐式神经表示（Dynamic-Aware INR, DA-INR）的模型。该模型的关键在于其能够捕捉动态MRI数据的空间和时间连续性，并显式地将数据的时间冗余性纳入模型结构中。通过这种方式，DA-INR在极端欠采样率下仍能实现优于其他模型的重建质量，同时显著减少优化时间并最小化超参数调优的需求。

链接: https://arxiv.org/abs/2501.09049
作者: Dayoung Baik,Jaejun Yoo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to the emergence of unsupervised learning approaches. A recent promising method among them is implicit neural representation (INR), which defines the data as a continuous function that maps coordinate values to the corresponding signal values. This allows for filling in missing information only with incomplete measurements and solving the inverse problem effectively. Nevertheless, previous works incorporating this method have faced drawbacks such as long optimization time and the need for extensive hyperparameter tuning. To address these issues, we propose Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction that captures the spatial and temporal continuity of dynamic MRI data in the image domain and explicitly incorporates the temporal redundancy of the data into the model structure. As a result, DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.
zh

[CV-87] Learning Hemodynamic Scalar Fields on Coronary Artery Meshes: A Benchmark of Geometric Deep Learning Models

【速读】：该论文旨在解决冠状动脉疾病（Coronary Artery Disease, CAD）诊断中的关键问题，即如何通过非侵入性方法准确评估冠状动脉的血流动力学状态。传统的诊断金标准——分数流储备（Fractional Flow Reserve, FFR）虽然准确，但其侵入性和高成本限制了广泛应用。为此，论文提出了一种基于计算流体动力学（Computational Fluid Dynamics, CFD）的虚拟FFR（vFFR）方法，通过几何深度学习算法（Geometric Deep Learning）在网格上学习血流动力学特征，以替代传统的CFD模拟。研究的关键在于比较了六种不同的后端模型，用于预测冠状动脉中的vFFR场，并发现基于Transformer的模型在处理复杂、异质性数据集时表现尤为出色，尤其是在预测狭窄病变中的压力和vFFR场时。研究结果表明，几何深度学习后端可以有效替代CFD用于简单几何结构，而Transformer网络在复杂数据集上表现更优，且压力降（Pressure Drop）被确定为学习压力相关场的最佳网络输出。

链接: https://arxiv.org/abs/2501.09046
作者: Guido Nannini,Julian Suk,Patryk Rygiel,Simone Saitta,Luca Mariani,Riccardo Maranga,Andrea Baggiano,Gianluca Pontone,Alberto Redaelli
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coronary artery disease, caused by the narrowing of coronary vessels due to atherosclerosis, is the leading cause of death worldwide. The diagnostic gold standard, fractional flow reserve (FFR), measures the trans-stenotic pressure ratio during maximal vasodilation but is invasive and costly. This has driven the development of virtual FFR (vFFR) using computational fluid dynamics (CFD) to simulate coronary flow. Geometric deep learning algorithms have shown promise for learning features on meshes, including cardiovascular research applications. This study empirically analyzes various backends for predicting vFFR fields in coronary arteries as CFD surrogates, comparing six backends for learning hemodynamics on meshes using CFD solutions as ground truth. The study has two parts: i) Using 1,500 synthetic left coronary artery bifurcations, models were trained to predict pressure-related fields for vFFR reconstruction, comparing different learning variables. ii) Using 427 patient-specific CFD simulations, experiments were repeated focusing on the best-performing learning variable from the synthetic dataset. Most backends performed well on the synthetic dataset, especially when predicting pressure drop over the manifold. Transformer-based backends outperformed others when predicting pressure and vFFR fields and were the only models achieving strong performance on patient-specific data, excelling in both average per-point error and vFFR accuracy in stenotic lesions. These results suggest geometric deep learning backends can effectively replace CFD for simple geometries, while transformer-based networks are superior for complex, heterogeneous datasets. Pressure drop was identified as the optimal network output for learning pressure-related fields. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.09046 [eess.IV] (or arXiv:2501.09046v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.09046 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Guido Nannini [view email] [v1] Wed, 15 Jan 2025 09:52:40 UTC (15,622 KB)
zh

人工智能

[AI-0] KU AIGEN ICL EDI@BC8 Track 3: Advancing Phenotype Named Entity Recognition and Normalization for Dysmorphology Physical Examination Reports

链接: https://arxiv.org/abs/2501.09744
作者: Hajung Kim,Chanhwi Kim,Jiwoong Sohn,Tim Beck,Marek Rei,Sunkyu Kim,T Ian Simpson,Joram M Posma,Antoine Lain,Mujeen Sung,Jaewoo Kang
类目: Artificial Intelligence (cs.AI)
*备注: This article is part of the Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models

点击查看摘要

Abstract:The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.

[AI-1] Parallel multi-objective metaheuristics for smart communications in vehicular networks

链接: https://arxiv.org/abs/2501.09725
作者: Jamal Toutouh,Enrique Alba
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:This article analyzes the use of two parallel multi-objective soft computing algorithms to automatically search for high-quality settings of the Ad hoc On Demand Vector routing protocol for vehicular networks. These methods are based on an evolutionary algorithm and on a swarm intelligence approach. The experimental analysis demonstrates that the configurations computed by our optimization algorithms outperform other state-of-the-art optimized ones. In turn, the computational efficiency achieved by all the parallel versions is greater than 87 %. Therefore, the line of work presented in this article represents an efficient framework to improve vehicular communications.

[AI-2] CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education

链接: https://arxiv.org/abs/2501.09709
作者: Tianyu Wang,Nianjun Zhou,Zhixiong Chen
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Many non-traditional students in cybersecurity programs often lack access to advice from peers, family members and professors, which can hinder their educational experiences. Additionally, these students may not fully benefit from various LLM-powered AI assistants due to issues like content relevance, locality of advice, minimum expertise, and timing. This paper addresses these challenges by introducing an application designed to provide comprehensive support by answering questions related to knowledge, skills, and career preparation advice tailored to the needs of these students. We developed a learning tool platform, CyberMentor, to address the diverse needs and pain points of students majoring in cybersecurity. Powered by agentic workflow and Generative Large Language Models (LLMs), the platform leverages Retrieval-Augmented Generation (RAG) for accurate and contextually relevant information retrieval to achieve accessibility and personalization. We demonstrated its value in addressing knowledge requirements for cybersecurity education and for career marketability, in tackling skill requirements for analytical and programming assignments, and in delivering real time on demand learning support. Using three use scenarios, we showcased CyberMentor in facilitating knowledge acquisition and career preparation and providing seamless skill-based guidance and support. We also employed the LangChain prompt-based evaluation methodology to evaluate the platform’s impact, confirming its strong performance in helpfulness, correctness, and completeness. These results underscore the system’s ability to support students in developing practical cybersecurity skills while improving equity and sustainability within higher education. Furthermore, CyberMentor’s open-source design allows for adaptation across other disciplines, fostering educational innovation and broadening its potential impact.

[AI-3] he Goofus Gallant Story Corpus for Practical Value Alignment ICML

链接: https://arxiv.org/abs/2501.09707
作者: Md Sultan Al Nahian,Tasmia Tasrin,Spencer Frazier,Mark Riedl,Brent Harrison
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by International Conference on Machine Learning and Applications (ICMLA) 2024. Main Conference, Long Paper

点击查看摘要

Abstract:Values or principles are key elements of human society that influence people to behave and function according to an accepted standard set of social rules to maintain social order. As AI systems are becoming ubiquitous in human society, it is a major concern that they could violate these norms or values and potentially cause harm. Thus, to prevent intentional or unintentional harm, AI systems are expected to take actions that align with these principles. Training systems to exhibit this type of behavior is difficult and often requires a specialized dataset. This work presents a multi-modal dataset illustrating normative and non-normative behavior in real-life situations described through natural language and artistic images. This training set contains curated sets of images that are designed to teach young children about social principles. We argue that this is an ideal dataset to use for training socially normative agents given this fact.

[AI-4] Cueless EEG imagined speech for subject identification: dataset and benchmarks

链接: https://arxiv.org/abs/2501.09700
作者: Ali Derakhshesh,Zahra Dehghanian,Reza Ebrahimpour,Hamid R. Rabiee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) signals have emerged as a promising modality for biometric identification. While previous studies have explored the use of imagined speech with semantically meaningful words for subject identification, most have relied on additional visual or auditory cues. In this study, we introduce a cueless EEG-based imagined speech paradigm, where subjects imagine the pronunciation of semantically meaningful words without any external cues. This innovative approach addresses the limitations of prior methods by requiring subjects to select and imagine words from a predefined list naturally. The dataset comprises over 4,350 trials from 11 subjects across five sessions. We assess a variety of classification methods, including traditional machine learning techniques such as Support Vector Machines (SVM) and XGBoost, as well as time-series foundation models and deep learning architectures specifically designed for EEG classification, such as EEG Conformer and Shallow ConvNet. A session-based hold-out validation strategy was employed to ensure reliable evaluation and prevent data leakage. Our results demonstrate outstanding classification accuracy, reaching 97.93%. These findings highlight the potential of cueless EEG paradigms for secure and reliable subject identification in real-world applications, such as brain-computer interfaces (BCIs).

[AI-5] Reward-Guided Controlled Generation for Inference-Time Alignment in Diffusion Models: Tutorial and Review

链接: https://arxiv.org/abs/2501.09685
作者: Masatoshi Uehara,Yulai Zhao,Chenyu Wang,Xiner Li,Aviv Regev,Sergey Levine,Tommaso Biancalani
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: We plan to add more content/codes. Please let us know if there are any comments

点击查看摘要

Abstract:This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques – such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance – aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at this https URL

[AI-6] Authenticated Delegation and Authorized AI Agents

链接: https://arxiv.org/abs/2501.09674
作者: Tobin South,Samuele Marro,Thomas Hardjono,Robert Mahari,Cedric Deslandes Whitney,Dazza Greenwood,Alan Chan,Alex Pentland
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction.

[AI-7] Monte Carlo Tree Search with Velocity Obstacles for safe and efficient motion planning in dynamic environments

链接: https://arxiv.org/abs/2501.09649
作者: Lorenzo Bonanni,Daniele Meli,Alberto Castellini,Alessandro Farinelli
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Online motion planning is a challenging problem for intelligent robots moving in dense environments with dynamic obstacles, e.g., crowds. In this work, we propose a novel approach for optimal and safe online motion planning with minimal information about dynamic obstacles. Specifically, our approach requires only the current position of the obstacles and their maximum speed, but it does not need any information about their exact trajectories or dynamic model. The proposed methodology combines Monte Carlo Tree Search (MCTS), for online optimal planning via model simulations, with Velocity Obstacles (VO), for obstacle avoidance. We perform experiments in a cluttered simulated environment with walls, and up to 40 dynamic obstacles moving with random velocities and directions. With an ablation study, we show the key contribution of VO in scaling up the efficiency of MCTS, selecting the safest and most rewarding actions in the tree of simulations. Moreover, we show the superiority of our methodology with respect to state-of-the-art planners, including Non-linear Model Predictive Control (NMPC), in terms of improved collision rate, computational and task performance.

[AI-8] NS-Gym: Open-Source Simulation Environments and Benchmarks for Non-Stationary Markov Decision Processes

链接: https://arxiv.org/abs/2501.09646
作者: Nathaniel S. Keplinger,Baiting Luo,Iliyas Bektas,Yunuo Zhang,Kyle Hollins Wray,Aron Laszka,Abhishek Dubey,Ayan Mukhopadhyay
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages, 17 figures

点击查看摘要

Abstract:In many real-world applications, agents must make sequential decisions in environments where conditions are subject to change due to various exogenous factors. These non-stationary environments pose significant challenges to traditional decision-making models, which typically assume stationary dynamics. Non-stationary Markov decision processes (NS-MDPs) offer a framework to model and solve decision problems under such changing conditions. However, the lack of standardized benchmarks and simulation tools has hindered systematic evaluation and advance in this field. We present NS-Gym, the first simulation toolkit designed explicitly for NS-MDPs, integrated within the popular Gymnasium framework. In NS-Gym, we segregate the evolution of the environmental parameters that characterize non-stationarity from the agent’s decision-making module, allowing for modular and flexible adaptations to dynamic environments. We review prior work in this domain and present a toolkit encapsulating key problem characteristics and types in NS-MDPs. This toolkit is the first effort to develop a set of standardized interfaces and benchmark problems to enable consistent and reproducible evaluation of algorithms under non-stationary conditions. We also benchmark six algorithmic approaches from prior work on NS-MDPs using NS-Gym. Our vision is that NS-Gym will enable researchers to assess the adaptability and robustness of their decision-making algorithms to non-stationary conditions.

[AI-9] Electronic Health Records: Towards Digital Twins in Healthcare

链接: https://arxiv.org/abs/2501.09640
作者: Muhammet Alkan,Hester Huijsdens,Yola Jones,Fani Deligianni
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The pivotal shift from traditional paper-based records to sophisticated Electronic Health Records (EHR), enabled systematic collection and analysis of patient data through descriptive statistics, providing insight into patterns and trends across patient populations. This evolution continued toward predictive analytics, allowing healthcare providers to anticipate patient outcomes and potential complications before they occur. This progression from basic digital record-keeping to sophisticated predictive modelling and digital twins reflects healthcare’s broader evolution toward more integrated, patient-centred approaches that combine data-driven insights with personalized care delivery. This chapter explores the evolution and significance of healthcare information systems, beginning with an examination of the implementation of EHR in the UK and the USA. It provides a comprehensive overview of the International Classification of Diseases (ICD) system, tracing its development from ICD-9 to ICD-10. Central to this discussion is the MIMIC-III database, a landmark achievement in healthcare data sharing and arguably the most comprehensive critical care database freely available to researchers worldwide. MIMIC-III has democratized access to high-quality healthcare data, enabling unprecedented opportunities for research and analysis. The chapter examines its structure, clinical outcome analysis capabilities, and practical applications through case studies, with a particular focus on mortality and length of stay metrics, vital signs extraction, and ICD coding. Through detailed entity-relationship diagrams and practical examples, the text illustrates MIMIC’s complex data structure and demonstrates how different querying approaches can lead to subtly different results, emphasizing the critical importance of understanding the database’s architecture for accurate data extraction.

[AI-10] Platform-Aware Mission Planning

链接: https://arxiv.org/abs/2501.09632
作者: Stefan Panjkovic,Alessandro Cimatti,Andrea Micheli,Stefano Tonetta
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Planning for autonomous systems typically requires reasoning with models at different levels of abstraction, and the harmonization of two competing sets of objectives: high-level mission goals that refer to an interaction of the system with the external environment, and low-level platform constraints that aim to preserve the integrity and the correct interaction of the subsystems. The complicated interplay between these two models makes it very hard to reason on the system as a whole, especially when the objective is to find plans with robustness guarantees, considering the non-deterministic behavior of the lower layers of the system. In this paper, we introduce the problem of Platform-Aware Mission Planning (PAMP), addressing it in the setting of temporal durative actions. The PAMP problem differs from standard temporal planning for its exists-forall nature: the high-level plan dealing with mission goals is required to satisfy safety and executability constraints, for all the possible non-deterministic executions of the low-level model of the platform and the environment. We propose two approaches for solving PAMP. The first baseline approach amalgamates the mission and platform levels, while the second is based on an abstraction-refinement loop that leverages the combination of a planner and a verification engine. We prove the soundness and completeness of the proposed approaches and validate them experimentally, demonstrating the importance of heterogeneous modeling and the superiority of the technique based on abstraction-refinement. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2501.09632 [cs.AI] (or arXiv:2501.09632v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.09632 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-11] Artificial Intelligence-Driven Clinical Decision Support Systems

链接: https://arxiv.org/abs/2501.09628
作者: Muhammet Alkan,Idris Zakariyya,Samuel Leighton,Kaushik Bhargav Sivangi,Christos Anagnostopoulos,Fani Deligianni
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) becomes increasingly embedded in healthcare delivery, this chapter explores the critical aspects of developing reliable and ethical Clinical Decision Support Systems (CDSS). Beginning with the fundamental transition from traditional statistical models to sophisticated machine learning approaches, this work examines rigorous validation strategies and performance assessment methods, including the crucial role of model calibration and decision curve analysis. The chapter emphasizes that creating trustworthy AI systems in healthcare requires more than just technical accuracy; it demands careful consideration of fairness, explainability, and privacy. The challenge of ensuring equitable healthcare delivery through AI is stressed, discussing methods to identify and mitigate bias in clinical predictive models. The chapter then delves into explainability as a cornerstone of human-centered CDSS. This focus reflects the understanding that healthcare professionals must not only trust AI recommendations but also comprehend their underlying reasoning. The discussion advances in an analysis of privacy vulnerabilities in medical AI systems, from data leakage in deep learning models to sophisticated attacks against model explanations. The text explores privacy-preservation strategies such as differential privacy and federated learning, while acknowledging the inherent trade-offs between privacy protection and model performance. This progression, from technical validation to ethical considerations, reflects the multifaceted challenges of developing AI systems that can be seamlessly and reliably integrated into daily clinical practice while maintaining the highest standards of patient care and data protection.

[AI-12] Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

链接: https://arxiv.org/abs/2501.09620
作者: Chaoqi Wang,Zhuokai Zhao,Yibo Jiang,Zhaorun Chen,Chen Zhu,Yuxin Chen,Jiayi Liu,Lizhu Zhang,Xiangjun Fan,Hao Ma,Sinong Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination that hinder the model’s ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.

[AI-13] Managed-Retention Memory: A New Class of Memory for the AI Era

链接: https://arxiv.org/abs/2501.09605
作者: Sergey Legtchenko,Ioan Stefanovici,Richard Black,Antony Rowstron,Junyi Liu,Paolo Costa,Burcu Canakci,Dushyanth Narayanan,Xingbo Wu
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注: 8 pages (5 content + 3 refs); 1 figure

点击查看摘要

Abstract:AI clusters today are one of the major uses of High Bandwidth Memory (HBM). However, HBM is suboptimal for AI workloads for several reasons. Analysis shows HBM is overprovisioned on write performance, but underprovisioned on density and read bandwidth, and also has significant energy per bit overheads. It is also expensive, with lower yield than DRAM due to manufacturing complexity. We propose a new memory class: Managed-Retention Memory (MRM), which is more optimized to store key data structures for AI inference workloads. We believe that MRM may finally provide a path to viability for technologies that were originally proposed to support Storage Class Memory (SCM). These technologies traditionally offered long-term persistence (10+ years) but provided poor IO performance and/or endurance. MRM makes different trade-offs, and by understanding the workload IO patterns, MRM foregoes long-term data retention and write performance for better potential performance on the metrics important for these workloads.

[AI-14] Reducing the Sensitivity of Neural Physics Simulators to Mesh Topology via Pretraining

链接: https://arxiv.org/abs/2501.09597
作者: Nathan Vaska,Justin Goodwin,Robin Walters,Rajmonda S. Caceres
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Meshes are used to represent complex objects in high fidelity physics simulators across a variety of domains, such as radar sensing and aerodynamics. There is growing interest in using neural networks to accelerate physics simulations, and also a growing body of work on applying neural networks directly to irregular mesh data. Since multiple mesh topologies can represent the same object, mesh augmentation is typically required to handle topological variation when training neural networks. Due to the sensitivity of physics simulators to small changes in mesh shape, it is challenging to use these augmentations when training neural network-based physics simulators. In this work, we show that variations in mesh topology can significantly reduce the performance of neural network simulators. We evaluate whether pretraining can be used to address this issue, and find that employing an established autoencoder pretraining technique with graph embedding models reduces the sensitivity of neural network simulators to variations in mesh topology. Finally, we highlight future research directions that may further reduce neural simulator sensitivity to mesh topology.

[AI-15] IFRA: a machine learning-based Instrumented Fall Risk Assessment Scale derived from Instrumented Timed Up and Go test in stroke patients

链接: https://arxiv.org/abs/2501.09595
作者: Simone Macciò,Alessandro Carfì,Alessio Capitanelli,Peppino Tropea,Massimo Corbo,Fulvio Mastrogiovanni,Michela Picardi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages, 2 figures, submitted for review dec 2024

点击查看摘要

Abstract:Effective fall risk assessment is critical for post-stroke patients. The present study proposes a novel, data-informed fall risk assessment method based on the instrumented Timed Up and Go (ITUG) test data, bringing in many mobility measures that traditional clinical scales fail to capture. IFRA, which stands for Instrumented Fall Risk Assessment, has been developed using a two-step process: first, features with the highest predictive power among those collected in a ITUG test have been identified using machine learning techniques; then, a strategy is proposed to stratify patients into low, medium, or high-risk strata. The dataset used in our analysis consists of 142 participants, out of which 93 were used for training (15 synthetically generated), 17 for validation and 32 to test the resulting IFRA scale (22 non-fallers and 10 fallers). Features considered in the IFRA scale include gait speed, vertical acceleration during sit-to-walk transition, and turning angular velocity, which align well with established literature on the risk of fall in neurological patients. In a comparison with traditional clinical scales such as the traditional Timed Up Go and the Mini-BESTest, IFRA demonstrates competitive performance, being the only scale to correctly assign more than half of the fallers to the high-risk stratum (Fischer’s Exact test p = 0.004). Despite the dataset’s limited size, this is the first proof-of-concept study to pave the way for future evidence regarding the use of IFRA tool for continuous patient monitoring and fall prevention both in clinical stroke rehabilitation and at home post-discharge.

[AI-16] MatrixNet: Learning over symmetry groups using learned group representations NEURIPS2024

链接: https://arxiv.org/abs/2501.09571
作者: Lucas Laird,Circe Hsu,Asilata Bapat,Robin Walters
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Representation Theory (math.RT)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Group theory has been used in machine learning to provide a theoretically grounded approach for incorporating known symmetry transformations in tasks from robotics to protein modeling. In these applications, equivariant neural networks use known symmetry groups with predefined representations to learn over geometric input data. We propose MatrixNet, a neural network architecture that learns matrix representations of group element inputs instead of using predefined representations. MatrixNet achieves higher sample efficiency and generalization over several standard baselines in prediction tasks over the several finite groups and the Artin braid group. We also show that MatrixNet respects group relations allowing generalization to group elements of greater word length than in the training set.

[AI-17] AI in Support of Diversity and Inclusion

链接: https://arxiv.org/abs/2501.09534
作者: Çiçek Güven,Afra Alishahi,Henry Brighton,Gonzalo Nápoles,Juan Sebastian Olier,Marie Šafář,Eric Postma,Dimitar Shterionov,Mirella De Sisto,Eva Vanmassenhove
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:In this paper, we elaborate on how AI can support diversity and inclusion and exemplify research projects conducted in that direction. We start by looking at the challenges and progress in making large language models (LLMs) more transparent, inclusive, and aware of social biases. Even though LLMs like ChatGPT have impressive abilities, they struggle to understand different cultural contexts and engage in meaningful, human like conversations. A key issue is that biases in language processing, especially in machine translation, can reinforce inequality. Tackling these biases requires a multidisciplinary approach to ensure AI promotes diversity, fairness, and inclusion. We also highlight AI’s role in identifying biased content in media, which is important for improving representation. By detecting unequal portrayals of social groups, AI can help challenge stereotypes and create more inclusive technologies. Transparent AI algorithms, which clearly explain their decisions, are essential for building trust and reducing bias in AI systems. We also stress AI systems need diverse and inclusive training data. Projects like the Child Growth Monitor show how using a wide range of data can help address real world problems like malnutrition and poverty. We present a project that demonstrates how AI can be applied to monitor the role of search engines in spreading disinformation about the LGBTQ+ community. Moreover, we discuss the SignON project as an example of how technology can bridge communication gaps between hearing and deaf people, emphasizing the importance of collaboration and mutual trust in developing inclusive AI. Overall, with this paper, we advocate for AI systems that are not only effective but also socially responsible, promoting fair and inclusive interactions between humans and machines.

[AI-18] Class Incremental Fault Diagnosis under Limited Fault Data via Supervised Contrastive Knowledge Distillation

链接: https://arxiv.org/abs/2501.09525
作者: Hanrong Zhang,Yifei Yao,Zixuan Wang,Jiayuan Su,Mengxuan Li,Peng Peng,Hongwei Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Class-incremental fault diagnosis requires a model to adapt to new fault classes while retaining previous knowledge. However, limited research exists for imbalanced and long-tailed data. Extracting discriminative features from few-shot fault data is challenging, and adding new fault classes often demands costly model retraining. Moreover, incremental training of existing methods risks catastrophic forgetting, and severe class imbalance can bias the model’s decisions toward normal classes. To tackle these issues, we introduce a Supervised Contrastive knowledge distiLlation for class Incremental Fault Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge distillation for improved representation learning capability and less forgetting, a novel prioritized exemplar selection method for sample replay to alleviate catastrophic forgetting, and the Random Forest Classifier to address the class imbalance. Extensive experimentation on simulated and real-world industrial datasets across various imbalance ratios demonstrates the superiority of SCLIFD over existing approaches. Our code can be found at this https URL.

[AI-19] Predicting Air Temperature from Volumetric Urban Morphology with Machine Learning

链接: https://arxiv.org/abs/2501.09469
作者: Berk Kıvılcım,Patrick Erik Bradley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages, 8 figures, 2 tables

点击查看摘要

Abstract:In this study, we firstly introduce a method that converts CityGML data into voxels which works efficiently and fast in high resolution for large scale datasets such as cities but by sacrificing some building details to overcome the limitations of previous voxelization methodologies that have been computationally intensive and inefficient at transforming large-scale urban areas into voxel representations for high resolution. Those voxelized 3D city data from multiple cities and corresponding air temperature data are used to develop a machine learning model. Before the model training, Gaussian blurring is implemented on input data to consider spatial relationships, as a result the correlation rate between air temperature and volumetric building morphology is also increased after the Gaussian blurring. After the model training, the prediction results are not just evaluated with Mean Square Error (MSE) but some image similarity metrics such as Structural Similarity Index Measure (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) that are able to detect and consider spatial relations during the evaluation process. This trained model is capable of predicting the spatial distribution of air temperature by using building volume information of corresponding pixel as input. By doing so, this research aims to assist urban planners in incorporating environmental parameters into their planning strategies, thereby facilitating more sustainable and inhabitable urban environments.

[AI-20] ADAGE: A generic two-layer framework for adaptive agent based modelling AAMAS

链接: https://arxiv.org/abs/2501.09429
作者: Benjamin Patrick Evans,Sihan Zeng,Sumitra Ganesh,Leo Ardon
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN); Computational Finance (q-fin.CP)
*备注: Accepted at the 2025 International Conference on Autonomous Agents and Multiagent Systems (AAMAS)

点击查看摘要

Abstract:Agent-based models (ABMs) are valuable for modelling complex, potentially out-of-equilibria scenarios. However, ABMs have long suffered from the Lucas critique, stating that agent behaviour should adapt to environmental changes. Furthermore, the environment itself often adapts to these behavioural changes, creating a complex bi-level adaptation problem. Recent progress integrating multi-agent reinforcement learning into ABMs introduces adaptive agent behaviour, beginning to address the first part of this critique, however, the approaches are still relatively ad hoc, lacking a general formulation, and furthermore, do not tackle the second aspect of simultaneously adapting environmental level characteristics in addition to the agent behaviours. In this work, we develop a generic two-layer framework for ADaptive AGEnt based modelling (ADAGE) for addressing these problems. This framework formalises the bi-level problem as a Stackelberg game with conditional behavioural policies, providing a consolidated framework for adaptive agent-based modelling based on solving a coupled set of non-linear equations. We demonstrate how this generic approach encapsulates several common (previously viewed as distinct) ABM tasks, such as policy design, calibration, scenario generation, and robust behavioural learning under one unified framework. We provide example simulations on multiple complex economic and financial environments, showing the strength of the novel framework under these canonical settings, addressing long-standing critiques of traditional ABMs.

[AI-21] MoE2: Optimizing Collaborative Inference for Edge Large Language Models

链接: https://arxiv.org/abs/2501.09410
作者: Lyudong Jin,Yanning Zhang,Yanhan Li,Shurong Wang,Howard H. Yang,Jian Wu,Meng Zhang
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE/ACM Transactions on Networking

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Exploiting the heterogeneous capabilities of edge LLMs is crucial for diverse emerging applications, as it enables greater cost-effectiveness and reduced latency. In this work, we introduce \textitMixture-of-Edge-Experts (MoE ^2 ), a novel collaborative inference framework for edge LLMs. We formulate the joint gating and expert selection problem to optimize inference performance under energy and latency constraints. Unlike conventional MoE problems, LLM expert selection is significantly more challenging due to the combinatorial nature and the heterogeneity of edge LLMs across various attributes. To this end, we propose a two-level expert selection mechanism through which we uncover an optimality-preserving property of gating parameters across expert selections. This property enables the decomposition of the training and selection processes, significantly reducing complexity. Furthermore, we leverage the objective’s monotonicity and design a discrete monotonic optimization algorithm for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results validate that performance improvements of various LLM models and show that our MoE ^2 method can achieve optimal trade-offs among different delay and energy budgets, and outperforms baselines under various system resource constraints.

[AI-22] ELM-DeepONets: Backpropagation-Free Training of Deep Operator Networks via Extreme Learning Machines

链接: https://arxiv.org/abs/2501.09395
作者: Hwijae Son
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deep Operator Networks (DeepONets) are among the most prominent frameworks for operator learning, grounded in the universal approximation theorem for operators. However, training DeepONets typically requires significant computational resources. To address this limitation, we propose ELM-DeepONets, an Extreme Learning Machine (ELM) framework for DeepONets that leverages the backpropagation-free nature of ELM. By reformulating DeepONet training as a least-squares problem for newly introduced parameters, the ELM-DeepONet approach significantly reduces training complexity. Validation on benchmark problems, including nonlinear ODEs and PDEs, demonstrates that the proposed method not only achieves superior accuracy but also drastically reduces computational costs. This work offers a scalable and efficient alternative for operator learning in scientific computing.

[AI-23] Aligning Instruction Tuning with Pre-training

链接: https://arxiv.org/abs/2501.09368
作者: Yiming Liang,Tianyu Zheng,Xinrun Du,Ge Zhang,Xingwei Qu,Xiang Yue,Chujie Zheng,Jiaheng Liu,Lei Ma,Wenhu Chen,Guoyin Wang,Zhaoxiang Zhang,Wenhao Huang,Jiajun Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose Aligning Instruction Tuning with Pre-training (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.

[AI-24] Style4Rec: Enhancing Transformer-based E-commerce Recommendation Systems with Style and Shopping Cart Information

链接: https://arxiv.org/abs/2501.09354
作者: Berke Ugurlu,Ming-Yi Hong,Che Lin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 images, 4 tables

点击查看摘要

Abstract:Understanding users’ product preferences is essential to the efficacy of a recommendation system. Precision marketing leverages users’ historical data to discern these preferences and recommends products that align with them. However, recent browsing and purchase records might better reflect current purchasing inclinations. Transformer-based recommendation systems have made strides in sequential recommendation tasks, but they often fall short in utilizing product image style information and shopping cart data effectively. In light of this, we propose Style4Rec, a transformer-based e-commerce recommendation system that harnesses style and shopping cart information to enhance existing transformer-based sequential product recommendation systems. Style4Rec represents a significant step forward in personalized e-commerce recommendations, outperforming benchmarks across various evaluation metrics. Style4Rec resulted in notable improvements: HR@5 increased from 0.681 to 0.735, NDCG@5 increased from 0.594 to 0.674, and MRR@5 increased from 0.559 to 0.654. We tested our model using an e-commerce dataset from our partnering company and found that it exceeded established transformer-based sequential recommendation benchmarks across various evaluation metrics. Thus, Style4Rec presents a significant step forward in personalized e-commerce recommendation systems.

[AI-25] Rational Tuning of LLM Cascades via Probabilistic Modeling

链接: https://arxiv.org/abs/2501.09345
作者: Michael J. Zellinger,Matt Thomson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding the reliability of large language models (LLMs) has recently garnered significant attention. Given LLMs’ propensity to hallucinate, as well as their high sensitivity to prompt design, it is already challenging to predict the performance of an individual LLM. However, the problem becomes more complex for compound LLM systems such as cascades, where in addition to each model’s standalone performance, we must understand how the error rates of different models interact. In this paper, we present a probabilistic model for the joint performance distribution of a sequence of LLMs, which enables a framework for rationally tuning the confidence thresholds of a LLM cascade using continuous optimization. Compared to selecting confidence thresholds using grid search, our parametric Markov-copula model significantly improves runtime scaling with respect to the length of the cascade and the desired resolution of the cost-error curve, turning them from intractable into low-order polynomial. In addition, the optimal thresholds computed using our continuous optimization-based algorithm increasingly outperform those found via grid search as cascade length grows, improving the area under the cost-error curve by 1.9% on average for cascades consisting of at least three models. Overall, our Markov-copula model provides a rational basis for tuning LLM cascade performance and points to the potential of probabilistic methods in analyzing LLM systems.

[AI-26] Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against Model Extraction Attacks

链接: https://arxiv.org/abs/2501.09328
作者: Yixiao Xu,Binxing Fang,Rui Wang,Yinghai Zhou,Shouling Ji,Yuan Liu,Mohan Li,Zhihong Tian
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing high-performance deep learning models is resource-intensive, leading model owners to utilize Machine Learning as a Service (MLaaS) platforms instead of publicly releasing their models. However, malicious users may exploit query interfaces to execute model extraction attacks, reconstructing the target model’s functionality locally. While prior research has investigated triggerable watermarking techniques for asserting ownership, existing methods face significant challenges: (1) most approaches require additional training, resulting in high overhead and limited flexibility, and (2) they often fail to account for advanced attackers, leaving them vulnerable to adaptive attacks. In this paper, we propose Neural Honeytrace, a robust plug-and-play watermarking framework against model extraction attacks. We first formulate a watermark transmission model from an information-theoretic perspective, providing an interpretable account of the principles and limitations of existing triggerable watermarking. Guided by the model, we further introduce: (1) a similarity-based training-free watermarking method for plug-and-play and flexible watermarking, and (2) a distribution-based multi-step watermark information transmission strategy for robust watermarking. Comprehensive experiments on four datasets demonstrate that Neural Honeytrace outperforms previous methods in efficiency and resisting adaptive attacks. Neural Honeytrace reduces the average number of samples required for a worst-case t-Test-based copyright claim from 12,000 to 200 with zero training cost. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.09328 [cs.CR] (or arXiv:2501.09328v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.09328 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] On Learning Informative Trajectory Embeddings for Imitation Classification and Regression AAMAS2025

链接: https://arxiv.org/abs/2501.09327
作者: Zichang Ge,Changyu Chen,Arunesh Sinha,Pradeep Varakantham
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AAMAS 2025

点击查看摘要

Abstract:In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at this https URL.

[AI-28] SOP-Agent : Empower General Purpose AI Agent with Domain-Specific SOPs

链接: https://arxiv.org/abs/2501.09316
作者: Anbang Ye,Qianran Ma,Jia Chen,Muqi Li,Tong Li,Fujiao Liu,Siqi Mai,Meichen Lu,Haitao Bao,Yang You
类目: Artificial Intelligence (cs.AI)
*备注: 35 pages, 5 figures

点击查看摘要

Abstract:Despite significant advancements in general-purpose AI agents, several challenges still hinder their practical application in real-world scenarios. First, the limited planning capabilities of Large Language Models (LLM) restrict AI agents from effectively solving complex tasks that require long-horizon planning. Second, general-purpose AI agents struggle to efficiently utilize domain-specific knowledge and human expertise. In this paper, we introduce the Standard Operational Procedure-guided Agent (SOP-agent), a novel framework for constructing domain-specific agents through pseudocode-style Standard Operational Procedures (SOPs) written in natural language. Formally, we represent a SOP as a decision graph, which is traversed to guide the agent in completing tasks specified by the SOP. We conduct extensive experiments across tasks in multiple domains, including decision-making, search and reasoning, code generation, data cleaning, and grounded customer service. The SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks and comparable to domain-specific agent systems. Additionally, we introduce the Grounded Customer Service Benchmark, the first benchmark designed to evaluate the grounded decision-making capabilities of AI agents in customer service scenarios based on SOPs.

[AI-29] LAVCap: LLM -based Audio-Visual Captioning using Optimal Transport ICASSP2025

链接: https://arxiv.org/abs/2501.09291
作者: Kyeongha Rho,Hyeongkeun Lee,Valentio Iverson,Joon Son Chung
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 2 figures; Accepted to ICASSP 2025

点击查看摘要

Abstract:Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at this https URL.

[AI-30] SEAL: Entangled White-box Watermarks on Low-Rank Adaptation

链接: https://arxiv.org/abs/2501.09284
作者: Giyeong Oh,Seajin Kim,Woohyun Cho,Sangkyu Lee,Jiwan Chung,Dokyung Song,Youngjae Yu
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 26 pages, 16 tables, 9 figures, initial version

点击查看摘要

Abstract:Recently, LoRA and its variants have become the de facto strategy for training and sharing task-specific versions of large pretrained models, thanks to their efficiency and simplicity. However, the issue of copyright protection for LoRA weights, especially through watermark-based techniques, remains underexplored. To address this gap, we propose SEAL (SEcure wAtermarking on LoRA weights), the universal whitebox watermarking for LoRA. SEAL embeds a secret, non-trainable matrix between trainable LoRA weights, serving as a passport to claim ownership. SEAL then entangles the passport with the LoRA weights through training, without extra loss for entanglement, and distributes the finetuned weights after hiding the passport. When applying SEAL, we observed no performance degradation across commonsense reasoning, textual/visual instruction tuning, and text-to-image synthesis tasks. We demonstrate that SEAL is robust against a variety of known attacks: removal, obfuscation, and ambiguity attacks.

[AI-31] xt Semantics to Flexible Design: A Residential Layout Generation Method Based on Stable Diffusion Model

链接: https://arxiv.org/abs/2501.09279
作者: Zijin Qiu,Jiepeng Liu,Yi Xia,Hongtuo Qi,Pengkun Liu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Flexibility in the AI-based residential layout design remains a significant challenge, as traditional methods like rule-based heuristics and graph-based generation often lack flexibility and require substantial design knowledge from users. To address these limitations, we propose a cross-modal design approach based on the Stable Diffusion model for generating flexible residential layouts. The method offers multiple input types for learning objectives, allowing users to specify both boundaries and layouts. It incorporates natural language as design constraints and introduces ControlNet to enable stable layout generation through two distinct pathways. We also present a scheme that encapsulates design expertise within a knowledge graph and translates it into natural language, providing an interpretable representation of design knowledge. This comprehensibility and diversity of input options enable professionals and non-professionals to directly express design requirements, enhancing flexibility and controllability. Finally, experiments verify the flexibility of the proposed methods under multimodal constraints better than state-of-the-art models, even when specific semantic information about room areas or connections is incomplete.

[AI-32] Large Language Model is Secretly a Protein Sequence Optimizer

链接: https://arxiv.org/abs/2501.09274
作者: Yinkai Wang,Jiaxing He,Yuanqi Du,Xiaohui Chen,Jianan Canal Li,Li-Ping Liu,Xiaolin Xu,Soha Hassoun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: Preprint

点击查看摘要

Abstract:We consider the protein sequence engineering problem, which aims to find protein sequences with high fitness levels, starting from a given wild-type sequence. Directed evolution has been a dominating paradigm in this field which has an iterative process to generate variants and select via experimental feedback. We demonstrate large language models (LLMs), despite being trained on massive texts, are secretly protein sequence optimizers. With a directed evolutionary method, LLM can perform protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes.

[AI-33] Clone-Robust AI Alignment

链接: https://arxiv.org/abs/2501.09254
作者: Ariel D. Procaccia,Benjamin Schiffer,Shirley Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF are not necessarily balanced in the types of questions and answers that are included. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.

[AI-34] AI-based Identity Fraud Detection: A Systematic Review

链接: https://arxiv.org/abs/2501.09239
作者: Chuo Jun Zhang,Asif Q. Gill,Bo Liu,Memoona J. Anwar
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of digital services, a large volume of personally identifiable information (PII) is stored online and is subject to cyberattacks such as Identity fraud. Most recently, the use of Artificial Intelligence (AI) enabled deep fake technologies has significantly increased the complexity of identity fraud. Fraudsters may use these technologies to create highly sophisticated counterfeit personal identification documents, photos and videos. These advancements in the identity fraud landscape pose challenges for identity fraud detection and society at large. There is a pressing need to review and understand identity fraud detection methods, their limitations and potential solutions. This research aims to address this important need by using the well-known systematic literature review method. This paper reviewed a selected set of 43 papers across 4 major academic literature databases. In particular, the review results highlight the two types of identity fraud prevention and detection methods, in-depth and open challenges. The results were also consolidated into a taxonomy of AI-based identity fraud detection and prevention methods including key insights and trends. Overall, this paper provides a foundational knowledge base to researchers and practitioners for further research and development in this important area of digital identity fraud.

[AI-35] Guiding Retrieval using LLM -based Listwise Rankers

链接: https://arxiv.org/abs/2501.09186
作者: Mandeep Rathee,Sean MacAvaney,Avishek Anand
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 16 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong promise as rerankers, especially in listwise'' settings where an LLM is prompted to rerank several search results at once. However, this cascading’’ retrieve-and-rerank approach is limited by the bounded recall problem: relevant documents not retrieved initially are permanently excluded from the final ranking. Adaptive retrieval techniques address this problem, but do not work with listwise rerankers because they assume a document’s score is computed independently from other documents. In this paper, we propose an adaptation of an existing adaptive retrieval method that supports the listwise setting and helps guide the retrieval process itself (thereby overcoming the bounded recall problem for LLM rerankers). Specifically, our proposed algorithm merges results both from the initial ranking and feedback documents provided by the most relevant documents seen up to that point. Through extensive experiments across diverse LLM rerankers, first stage retrievers, and feedback sources, we demonstrate that our method can improve nDCG@10 by up to 13.23% and recall by 28.02%–all while keeping the total number of LLM inferences constant and overheads due to the adaptive process minimal. The work opens the door to leveraging LLM-based search in settings where the initial pool of results is limited, e.g., by legacy systems, or by the cost of deploying a semantic first-stage.

[AI-36] A Blockchain-Enabled Approach to Cross-Border Compliance and Trust

链接: https://arxiv.org/abs/2501.09182
作者: Vikram Kulothungan
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: This is a preprint of paper that has been accepted for Publication at 2024 IEEE International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications

点击查看摘要

Abstract:As artificial intelligence (AI) systems become increasingly integral to critical infrastructure and global operations, the need for a unified, trustworthy governance framework is more urgent that ever. This paper proposes a novel approach to AI governance, utilizing blockchain and distributed ledger technologies (DLT) to establish a decentralized, globally recognized framework that ensures security, privacy, and trustworthiness of AI systems across borders. The paper presents specific implementation scenarios within the financial sector, outlines a phased deployment timeline over the next decade, and addresses potential challenges with solutions grounded in current research. By synthesizing advancements in blockchain, AI ethics, and cybersecurity, this paper offers a comprehensive roadmap for a decentralized AI governance framework capable of adapting to the complex and evolving landscape of global AI regulation.

[AI-37] Attention is All You Need Until You Need Retention

链接: https://arxiv.org/abs/2501.09166
作者: M. Murat Yaslioglu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data population, dynamic recall, and guided output generation. This enhancement allows models to store, update, and reuse observed patterns across sessions, enabling incremental learning and bridging the gap between static pretraining and dynamic, context sensitive adaptation. The Retention Layer design parallels social learning processes, encompassing attention, retention, reproduction, and motivation stages. Technically, it integrates a memory attention mechanism and episodic buffers to manage memory scalability, mitigate overfitting, and ensure efficient recall. Applications span adaptive personal assistants, real time fraud detection, autonomous robotics, content moderation, and healthcare diagnostics. In each domain, the retention mechanism enables systems to learn incrementally, personalize outputs, and respond to evolving real world challenges effectively. By emulating key aspects of human learning, this retention enhanced architecture fosters a more fluid and responsive AI paradigm, paving the way for dynamic, session aware models that extend the capabilities of traditional Transformers into domains requiring continual adaptation.

[AI-38] owards Understanding Extrapolation: a Causal Lens NEURIPS2024

链接: https://arxiv.org/abs/2501.09163
作者: Lingjing Kong,Guangyi Chen,Petar Stojanov,Haoxuan Li,Eric P. Xing,Kun Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Canonical work handling distribution shifts typically necessitates an entire target distribution that lands inside the training distribution. However, practical scenarios often involve only a handful of target samples, potentially lying outside the training support, which requires the capability of extrapolation. In this work, we aim to provide a theoretical understanding of when extrapolation is possible and offer principled methods to achieve it without requiring an on-support target distribution. To this end, we formulate the extrapolation problem with a latent-variable model that embodies the minimal change principle in causal mechanisms. Under this formulation, we cast the extrapolation problem into a latent-variable identification problem. We provide realistic conditions on shift properties and the estimation objectives that lead to identification even when only one off-support target sample is available, tackling the most challenging scenarios. Our theory reveals the intricate interplay between the underlying manifold’s smoothness and the shift properties. We showcase how our theoretical results inform the design of practical adaptation algorithms. Through experiments on both synthetic and real-world data, we validate our theoretical findings and their practical implications.

[AI-39] AutoLoop: Fast Visual SLAM Fine-tuning through Agent ic Curriculum Learning

链接: https://arxiv.org/abs/2501.09160
作者: Assaf Lahiany,Oren Gal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current visual SLAM systems face significant challenges in balancing computational efficiency with robust loop closure handling. Traditional approaches require careful manual tuning and incur substantial computational overhead, while learning-based methods either lack explicit loop closure capabilities or implement them through computationally expensive methods. We present AutoLoop, a novel approach that combines automated curriculum learning with efficient fine-tuning for visual SLAM systems. Our method employs a DDPG (Deep Deterministic Policy Gradient) agent to dynamically adjust loop closure weights during training, eliminating the need for manual hyperparameter search while significantly reducing the required training steps. The approach pre-computes potential loop closure pairs offline and leverages them through an agent-guided curriculum, allowing the model to adapt efficiently to new scenarios. Experiments conducted on TartanAir for training and validated across multiple benchmarks including KITTI, EuRoC, ICL-NUIM and TUM RGB-D demonstrate that AutoLoop achieves comparable or superior performance while reducing training time by an order of magnitude compared to traditional approaches. AutoLoop provides a practical solution for rapid adaptation of visual SLAM systems, automating the weight tuning process that traditionally requires multiple manual iterations. Our results show that this automated curriculum strategy not only accelerates training but also maintains or improves the model’s performance across diverse environmental conditions.

[AI-40] A Non-autoregressive Model for Joint STT and TTS

链接: https://arxiv.org/abs/2501.09104
作者: Vishal Sunder,Brian Kingsbury,George Saon,Samuel Thomas,Slava Shechtman Hagai Aronowitz,Eric Fosler-Lussier,Luis Lastras
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 3 figures, 3 tables

点击查看摘要

Abstract:In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.

[AI-41] racking the Takes and Trajectories of English-Language News Narratives across Trustworthy and Worrisome Websites USENIX-SECURITY

链接: https://arxiv.org/abs/2501.09102
作者: Hans W. A. Hanley,Emily Okabe,Zakir Durumeric
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: To appear at USENIX Security Symposium 2025. Keywords: Misinformation, News, Narratives, LLMs, Stance-Detection

点击查看摘要

Abstract:Understanding how misleading and outright false information enters news ecosystems remains a difficult challenge that requires tracking how narratives spread across thousands of fringe and mainstream news websites. To do this, we introduce a system that utilizes encoder-based large language models and zero-shot stance detection to scalably identify and track news narratives and their attitudes across over 4,000 factually unreliable, mixed-reliability, and factually reliable English-language news websites. Running our system over an 18 month period, we track the spread of 146K news stories. Using network-based interference via the NETINF algorithm, we show that the paths of news narratives and the stances of websites toward particular entities can be used to uncover slanted propaganda networks (e.g., anti-vaccine and anti-Ukraine) and to identify the most influential websites in spreading these attitudes in the broader news ecosystem. We hope that increased visibility into our distributed news ecosystem can help with the reporting and fact-checking of propaganda and disinformation.

[AI-42] Inferring Transition Dynamics from Value Functions AAAI-25

链接: https://arxiv.org/abs/2501.09081
作者: Jacob Adamczyk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the AAAI-25 8th Workshop on Generalization in Planning

点击查看摘要

Abstract:In reinforcement learning, the value function is typically trained to solve the Bellman equation, which connects the current value to future values. This temporal dependency hints that the value function may contain implicit information about the environment’s transition dynamics. By rearranging the Bellman equation, we show that a converged value function encodes a model of the underlying dynamics of the environment. We build on this insight to propose a simple method for inferring dynamics models directly from the value function, potentially mitigating the need for explicit model learning. Furthermore, we explore the challenges of next-state identifiability, discussing conditions under which the inferred dynamics model is well-defined. Our work provides a theoretical foundation for leveraging value functions in dynamics modeling and opens a new avenue for bridging model-free and model-based reinforcement learning.

[AI-43] Averag e-Reward Reinforcement Learning with Entropy Regularization AAAI-25

链接: https://arxiv.org/abs/2501.09080
作者: Jacob Adamczyk,Volodymyr Makarenko,Stas Tiomkin,Rahul V. Kulkarni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the AAAI-25 Eighth Workshop on Bridging the Gap Between AI Planning and Reinforcement Learning (PRL)

点击查看摘要

Abstract:The average-reward formulation of reinforcement learning (RL) has drawn increased interest in recent years due to its ability to solve temporally-extended problems without discounting. Independently, RL algorithms have benefited from entropy-regularization: an approach used to make the optimal policy stochastic, thereby more robust to noise. Despite the distinct benefits of the two approaches, the combination of entropy regularization with an average-reward objective is not well-studied in the literature and there has been limited development of algorithms for this setting. To address this gap in the field, we develop algorithms for solving entropy-regularized average-reward RL problems with function approximation. We experimentally validate our method, comparing it with existing algorithms on standard benchmarks for RL.

[AI-44] Playing Devils Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models

链接: https://arxiv.org/abs/2501.09039
作者: Abdulkadir Erol,Trilok Padhi,Agnik Saha,Ugur Kursuncu,Mehmet Emin Aktas
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The rapid advancement of Large Vision-Language Models (LVLMs) has enhanced capabilities offering potential applications from content creation to productivity enhancement. Despite their innovative potential, LVLMs exhibit vulnerabilities, especially in generating potentially toxic or unsafe responses. Malicious actors can exploit these vulnerabilities to propagate toxic content in an automated (or semi-) manner, leveraging the susceptibility of LVLMs to deception via strategically crafted prompts without fine-tuning or compute-intensive procedures. Despite the red-teaming efforts and inherent potential risks associated with the LVLMs, exploring vulnerabilities of LVLMs remains nascent and yet to be fully addressed in a systematic manner. This study systematically examines the vulnerabilities of open-source LVLMs, including LLaVA, InstructBLIP, Fuyu, and Qwen, using adversarial prompt strategies that simulate real-world social manipulation tactics informed by social theories. Our findings show that (i) toxicity and insulting are the most prevalent behaviors, with the mean rates of 16.13% and 9.75%, respectively; (ii) Qwen-VL-Chat, LLaVA-v1.6-Vicuna-7b, and InstructBLIP-Vicuna-7b are the most vulnerable models, exhibiting toxic response rates of 21.50%, 18.30% and 17.90%, and insulting responses of 13.40%, 11.70% and 10.10%, respectively; (iii) prompting strategies incorporating dark humor and multimodal toxic prompt completion significantly elevated these vulnerabilities. Despite being fine-tuned for safety, these models still generate content with varying degrees of toxicity when prompted with adversarial inputs, highlighting the urgent need for enhanced safety mechanisms and robust guardrails in LVLM development.

[AI-45] Synthetic Data and Health Privacy

链接: https://arxiv.org/abs/2501.09031
作者: Gwénolé Abgrall,Xavier Monnet,Anmol Arora
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: JAMA Cardiology, 2024

点击查看摘要

Abstract:This Viewpoint discusses generative artificial intelligence and safeguarding privacy by using synthetic data as a substitute for private health data.

[AI-46] Enhancing Data Integrity through Provenance Tracking in Semantic Web Frameworks

链接: https://arxiv.org/abs/2501.09029
作者: Nilesh Jain
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: This 10-page manuscript with 5 figures focuses on leveraging Semantic Web frameworks to enhance data integrity through provenance tracking. Intended for conference submission, it aligns with the cs.AI category, addressing knowledge representation, data modeling, and uncertainty in AI using advanced tools like PROV-DM and PROV-O

点击查看摘要

Abstract:This paper explores the integration of provenance tracking systems within the context of Semantic Web technologies to enhance data integrity in diverse operational environments. SURROUND Australia Pty Ltd demonstrates innovative applica-tions of the PROV Data Model (PROV-DM) and its Semantic Web variant, PROV-O, to systematically record and manage provenance information across multiple data processing domains. By employing RDF and Knowledge Graphs, SURROUND ad-dresses the critical challenges of shared entity identification and provenance granularity. The paper highlights the company’s architecture for capturing comprehensive provenance data, en-abling robust validation, traceability, and knowledge inference. Through the examination of two projects, we illustrate how provenance mechanisms not only improve data reliability but also facilitate seamless integration across heterogeneous systems. Our findings underscore the importance of sophisticated provenance solutions in maintaining data integrity, serving as a reference for industry peers and academics engaged in provenance research and implementation.

[AI-47] Intelligent Anti-Money Laundering Solution Based upon Novel Community Detection in Massive Transaction Networks on Spark

链接: https://arxiv.org/abs/2501.09026
作者: Xurui Li,Xiang Cao,Xuetao Qiu,Jintao Zhao,Jianbin Zheng
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Criminals are using every means available to launder the profits from their illegal activities into ostensibly legitimate assets. Meanwhile, most commercial anti-money laundering systems are still rule-based, which cannot adapt to the ever-changing tricks. Although some machine learning methods have been proposed, they are mainly focused on the perspective of abnormal behavior for single accounts. Considering money laundering activities are often involved in gang criminals, these methods are still not intelligent enough to crack down on criminal gangs all-sidedly. In this paper, a systematic solution is presented to find suspicious money laundering gangs. A temporal-directed Louvain algorithm has been proposed to detect communities according to relevant anti-money laundering patterns. All processes are implemented and optimized on Spark platform. This solution can greatly improve the efficiency of anti-money laundering work for financial regulation agencies.

[AI-48] Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy Measures

链接: https://arxiv.org/abs/2501.09025
作者: Marc Schmitt,Pantelis Koutroumpis
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
*备注: Forthcoming in IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:The digital age, driven by the AI revolution, brings significant opportunities but also conceals security threats, which we refer to as cyber shadows. These threats pose risks at individual, organizational, and societal levels. This paper examines the systemic impact of these cyber threats and proposes a comprehensive cybersecurity strategy that integrates AI-driven solutions, such as Intrusion Detection Systems (IDS), with targeted policy interventions. By combining technological and regulatory measures, we create a multilevel defense capable of addressing both direct threats and indirect negative externalities. We emphasize that the synergy between AI-driven solutions and policy interventions is essential for neutralizing cyber threats and mitigating their negative impact on the digital economy. Finally, we underscore the need for continuous adaptation of these strategies, especially in response to the rapid advancement of autonomous AI-driven attacks, to ensure the creation of secure and resilient digital ecosystems.

[AI-49] Navigating Ethical Challenges in Generative AI-Enhanced Research: The ETHICAL Framework for Responsible Generative AI Use

链接: https://arxiv.org/abs/2501.09021
作者: Douglas Eacersall,Lynette Pretorius,Ivan Smirnov,Erika Spray,Sam Illingworth,Ritesh Chugh,Sonja Strydom,Dianne Stratton-Maher,Jonathan Simmons,Isaac Jennings,Rian Roux,Ruth Kamrowski,Abigail Downie,Chee Ling Thong,Katharine A. Howell
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 28 pages, 1 figure

点击查看摘要

Abstract:The rapid adoption of generative artificial intelligence (GenAI) in research presents both opportunities and ethical challenges that should be carefully navigated. Although GenAI tools can enhance research efficiency through automation of tasks such as literature review and data analysis, their use raises concerns about aspects such as data accuracy, privacy, bias, and research integrity. This paper develops the ETHICAL framework, which is a practical guide for responsible GenAI use in research. Employing a constructivist case study examining multiple GenAI tools in real research contexts, the framework consists of seven key principles: Examine policies and guidelines, Think about social impacts, Harness understanding of the technology, Indicate use, Critically engage with outputs, Access secure versions, and Look at user agreements. Applying these principles will enable researchers to uphold research integrity while leveraging GenAI benefits. The framework addresses a critical gap between awareness of ethical issues and practical action steps, providing researchers with concrete guidance for ethical GenAI integration. This work has implications for research practice, institutional policy development, and the broader academic community while adapting to an AI-enhanced research landscape. The ETHICAL framework can serve as a foundation for developing AI literacy in academic settings and promoting responsible innovation in research methodologies.

[AI-50] Incorporating Quantum Advantage in Quantum Circuit Generation through Genetic Programming

链接: https://arxiv.org/abs/2501.09682
作者: Christoph Stein,Michael Färber
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Designing efficient quantum circuits that leverage quantum advantage compared to classical computing has become increasingly critical. Genetic algorithms have shown potential in generating such circuits through artificial evolution. However, integrating quantum advantage into the fitness function of these algorithms remains unexplored. In this paper, we aim to enhance the efficiency of quantum circuit design by proposing two novel approaches for incorporating quantum advantage metrics into the fitness function of genetic algorithms.1 We evaluate our approaches based on the Bernstein-Vazirani Problem and the Unstructured Database Search Problem as test cases. The results demonstrate that our approaches not only improve the convergence speed of the genetic algorithm but also produce circuits comparable to expert-designed solutions. Our findings suggest that automated quantum circuit design using genetic algorithms that incorporate a measure of quantum advantage is a promising approach to accelerating the development of quantum algorithms.

[AI-51] Quantum-Enhanced Transformers for Robust Acoustic Scene Classification in IoT Environments

链接: https://arxiv.org/abs/2501.09394
作者: Minh K. Quan,Mayuri Wijayasundara,Sujeeva Setunge,Pubudu N. Pathirana
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF); Sound (cs.SD)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:The proliferation of Internet of Things (IoT) devices equipped with acoustic sensors necessitates robust acoustic scene classification (ASC) capabilities, even in noisy and data-limited environments. Traditional machine learning methods often struggle to generalize effectively under such conditions. To address this, we introduce Q-ASC, a novel Quantum-Inspired Acoustic Scene Classifier that leverages the power of quantum-inspired transformers. By integrating quantum concepts like superposition and entanglement, Q-ASC achieves superior feature learning and enhanced noise resilience compared to classical models. Furthermore, we introduce a Quantum Variational Autoencoder (QVAE) based data augmentation technique to mitigate the challenge of limited labeled data in IoT deployments. Extensive evaluations on the Tampere University of Technology (TUT) Acoustic Scenes 2016 benchmark dataset demonstrate that Q-ASC achieves remarkable accuracy between 68.3% and 88.5% under challenging conditions, outperforming state-of-the-art methods by over 5% in the best case. This research paves the way for deploying intelligent acoustic sensing in IoT networks, with potential applications in smart homes, industrial monitoring, and environmental surveillance, even in adverse acoustic environments.

[AI-52] Interpretable Droplet Digital PCR Assay for Trustworthy Molecular Diagnostics

链接: https://arxiv.org/abs/2501.09218
作者: Yuanyuan Wei,Yucheng Wu,Fuyang Qu,Yao Mu,Yi-Ping Ho,Ho-Pui Ho,Wu Yuan,Mingkun Xu
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate molecular quantification is essential for advancing research and diagnostics in fields such as infectious diseases, cancer biology, and genetic disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for achieving absolute quantification. While computational ddPCR technologies have advanced significantly, achieving automatic interpretation and consistent adaptability across diverse operational environments remains a challenge. To address these limitations, we introduce the intelligent interpretable droplet digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end predictive models (for droplet segmentation and classification) with GPT-4o multimodal large language model (MLLM, for context-aware explanations and recommendations) to automate and enhance ddPCR image analysis. This approach surpasses the state-of-the-art models, affording 99.05% accuracy in processing complex ddPCR images containing over 300 droplets per image with varying signal-to-noise ratios (SNRs). By combining specialized neural networks and large language models, the I2ddPCR assay offers a robust and adaptable solution for absolute molecular quantification, achieving a sensitivity capable of detecting low-abundance targets as low as 90.32 copies/\muL. Furthermore, it improves model’s transparency through detailed explanation and troubleshooting guidance, empowering users to make informed decisions. This innovative framework has the potential to benefit molecular diagnostics, disease research, and clinical applications, especially in resource-constrained settings.

[AI-53] Mantis Shrimp: Exploring Photometric Band Utilization in Computer Vision Networks for Photometric Redshift Estimation

链接: https://arxiv.org/abs/2501.09112
作者: Andrew Engel,Nell Byler,Adam Tsou,Gautham Narayan,Emmanuel Bonilla,Ian Smith
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present Mantis Shrimp, a multi-survey deep learning model for photometric redshift estimation that fuses ultra-violet (GALEX), optical (PanSTARRS), and infrared (UnWISE) imagery. Machine learning is now an established approach for photometric redshift estimation, with generally acknowledged higher performance in areas with a high density of spectroscopically identified galaxies over template-based methods. Multiple works have shown that image-based convolutional neural networks can outperform tabular-based color/magnitude models. In comparison to tabular models, image models have additional design complexities: it is largely unknown how to fuse inputs from different instruments which have different resolutions or noise properties. The Mantis Shrimp model estimates the conditional density estimate of redshift using cutout images. The density estimates are well calibrated and the point estimates perform well in the distribution of available spectroscopically confirmed galaxies with (bias = 1e-2), scatter (NMAD = 2.44e-2) and catastrophic outlier rate ( \eta =17.53 % ). We find that early fusion approaches (e.g., resampling and stacking images from different instruments) match the performance of late fusion approaches (e.g., concatenating latent space representations), so that the design choice ultimately is left to the user. Finally, we study how the models learn to use information across bands, finding evidence that our models successfully incorporates information from all surveys. The applicability of our model to the analysis of large populations of galaxies is limited by the speed of downloading cutouts from external servers; however, our model could be useful in smaller studies such as generating priors over redshift for stellar population synthesis.

机器学习

[LG-0] FAST: Efficient Action Tokenization for Vision-Language-Action Models WWW FAST

链接: https://arxiv.org/abs/2501.09747
作者: Karl Pertsch,Kyle Stachowicz,Brian Ichter,Danny Driess,Suraj Nair,Quan Vuong,Oier Mees,Chelsea Finn,Sergey Levine
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.

[LG-1] Generating particle physics Lagrangians with transformers

链接: https://arxiv.org/abs/2501.09729
作者: Yong Sheng Koay,Rikard Enberg,Stefano Moretti,Eliel Camargo-Molina
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th)
*备注: 32 pages, 11 figues, 18 tables

点击查看摘要

Abstract:In physics, Lagrangians provide a systematic way to describe laws governing physical systems. In the context of particle physics, they encode the interactions and behavior of the fundamental building blocks of our universe. By treating Lagrangians as complex, rule-based constructs similar to linguistic expressions, we trained a transformer model – proven to be effective in natural language tasks – to predict the Lagrangian corresponding to a given list of particles. We report on the transformer’s performance in constructing Lagrangians respecting the Standard Model \mathrmSU(3)\times \mathrmSU(2)\times \mathrmU(1) gauge symmetries. The resulting model is shown to achieve high accuracies (over 90%) with Lagrangians up to six matter fields, with the capacity to generalize beyond the training distribution, albeit within architectural constraints. We show through an analysis of input embeddings that the model has internalized concepts such as group representations and conjugation operations as it learned to generate Lagrangians. We make the model and training datasets available to the community. An interactive demonstration can be found at: \urlthis https URL.

[LG-2] A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise

链接: https://arxiv.org/abs/2501.09691
作者: Ilias Diakonikolas,Nikos Zarifis
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of PAC learning \gamma -margin halfspaces in the presence of Massart noise. Without computational considerations, the sample complexity of this learning problem is known to be \widetilde\Theta(1/(\gamma^2 \epsilon)) . Prior computationally efficient algorithms for the problem incur sample complexity \tildeO(1/(\gamma^4 \epsilon^3)) and achieve 0-1 error of \eta+\epsilon , where \eta1/2 is the upper bound on the noise rate. Recent work gave evidence of an information-computation tradeoff, suggesting that a quadratic dependence on 1/\epsilon is required for computationally efficient algorithms. Our main result is a computationally efficient learner with sample complexity \widetilde\Theta(1/(\gamma^2 \epsilon^2)) , nearly matching this lower bound. In addition, our algorithm is simple and practical, relying on online SGD on a carefully selected sequence of convex losses.

[LG-3] U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detection ML4H ALT

链接: https://arxiv.org/abs/2501.09687
作者: Jiaee Cheong,Aditya Bangar,Sinan Kalkan,Hatice Gunes
类目: Machine Learning (cs.LG)
*备注: To appear at the Proceedings of Machine Learning Research 259, 1-14, 2024 as part of the Machine Learning for Health (ML4H) Symposium 2024

点击查看摘要

Abstract:Machine learning bias in mental health is becoming an increasingly pertinent challenge. Despite promising efforts indicating that multitask approaches often work better than unitask approaches, there is minimal work investigat- ing the impact of multitask learning on performance and fairness in depression detection nor leveraged it to achieve fairer prediction outcomes. In this work, we undertake a systematic investigation of using a multitask approach to improve performance and fairness for depression detection. We propose a novel gender-based task-reweighting method using uncertainty grounded in how the PHQ-8 questionnaire is structured. Our results indicate that, although a multitask approach improves performance and fairness compared to a unitask approach, the results are not always consistent and we see evidence of negative transfer and a reduction in the Pareto frontier, which is concerning given the high-stake healthcare setting. Our proposed approach of gender-based reweighting with uncertainty improves performance and fairness and alleviates both challenges to a certain extent. Our findings on each PHQ-8 subitem task difficulty are also in agreement with the largest study conducted on the PHQ-8 subitem discrimination capacity, thus providing the very first tangible evidence linking ML findings with large-scale empirical population studies conducted on the PHQ-8.

[LG-4] Fokker-Planck to Callan-Symanzik: evolution of weight matrices under training

链接: https://arxiv.org/abs/2501.09659
作者: Wei Bu,Uri Kol,Ziming Liu
类目: Machine Learning (cs.LG)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:The dynamical evolution of a neural network during training has been an incredibly fascinating subject of study. First principal derivation of generic evolution of variables in statistical physics systems has proved useful when used to describe training dynamics conceptually, which in practice means numerically solving equations such as Fokker-Planck equation. Simulating entire networks inevitably runs into the curse of dimensionality. In this paper, we utilize Fokker-Planck to simulate the probability density evolution of individual weight matrices in the bottleneck layers of a simple 2-bottleneck-layered auto-encoder and compare the theoretical evolutions against the empirical ones by examining the output data distributions. We also derive physically relevant partial differential equations such as Callan-Symanzik and Kardar-Parisi-Zhang equations from the dynamical equation we have.

[LG-5] A Survey of Research in Large Language Models for Electronic Design Automation

链接: https://arxiv.org/abs/2501.09655
作者: Jingyu Pan,Guanglei Zhou,Chen-Chia Chang,Isaac Jacobson,Jiang Hu,Yiran Chen
类目: Machine Learning (cs.LG)
*备注: 21 pages, 2 figures, 3 tables, accepted by TODAES

点击查看摘要

Abstract:Within the rapidly evolving domain of Electronic Design Automation (EDA), Large Language Models (LLMs) have emerged as transformative technologies, offering unprecedented capabilities for optimizing and automating various aspects of electronic design. This survey provides a comprehensive exploration of LLM applications in EDA, focusing on advancements in model architectures, the implications of varying model sizes, and innovative customization techniques that enable tailored analytical insights. By examining the intersection of LLM capabilities and EDA requirements, the paper highlights the significant impact these models have on extracting nuanced understandings from complex datasets. Furthermore, it addresses the challenges and opportunities in integrating LLMs into EDA workflows, paving the way for future research and application in this dynamic field. Through this detailed analysis, the survey aims to offer valuable insights to professionals in the EDA industry, AI researchers, and anyone interested in the convergence of advanced AI technologies and electronic design.

[LG-6] LLM -Based Routing in Mixture of Experts: A Novel Framework for Trading AAAI2025

链接: https://arxiv.org/abs/2501.09636
作者: Kuan-Ming Liu(1),Ming-Chih Lo(2) ((1) National Chengchi University, College of Commerce, (2) National Yang Ming Chiao Tung University, College of Computer Science)
类目: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注: Accepted by AAAI 2025 Workshop on AI for Social Impact - Bridging Innovations in Finance, Social Media, and Crime Prevention

点击查看摘要

Abstract:Recent advances in deep learning and large language models (LLMs) have facilitated the deployment of the mixture-of-experts (MoE) mechanism in the stock investment domain. While these models have demonstrated promising trading performance, they are often unimodal, neglecting the wealth of information available in other modalities, such as textual data. Moreover, the traditional neural network-based router selection mechanism fails to consider contextual and real-world nuances, resulting in suboptimal expert selection. To address these limitations, we propose LLMoE, a novel framework that employs LLMs as the router within the MoE architecture. Specifically, we replace the conventional neural network-based router with LLMs, leveraging their extensive world knowledge and reasoning capabilities to select experts based on historical price data and stock news. This approach provides a more effective and interpretable selection mechanism. Our experiments on multimodal real-world stock datasets demonstrate that LLMoE outperforms state-of-the-art MoE models and other deep neural network approaches. Additionally, the flexible architecture of LLMoE allows for easy adaptation to various downstream tasks.

[LG-7] Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework

链接: https://arxiv.org/abs/2501.09631
作者: Yushen Lin,Ruichen Zhang,Wenqi Huang,Kaidi Wang,Zhiguo Ding,Daniel K. C. So,Dusit Niyato
类目: Machine Learning (cs.LG)
*备注: 13 pages, 13 figure, journal

点击查看摘要

Abstract:In this work, we develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) specifically for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. By utilizing advanced language models for entity extraction and question generation, rigorous data curation processes are employed to maintain high quality and relevance. Additionally, we introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data with 2.24% and 1.31% performance boost for different models compared to baselines, respectively. To demonstrate the effectiveness of the fine-tuned models with the proposed methodologies on practical tasks, we also consider different tasks, including summarizing optimization problems from technical papers and solving the mathematical problems related to non-orthogonal multiple access (NOMA), which are generated by using the proposed multi-agent framework. Simulation results show significant performance gain in summarization tasks with 20.9% in the ROUGE-L metrics. We also study the scaling laws of fine-tuning LLMs and the challenges LLMs face in the field of wireless communications, offering insights into their adaptation to wireless communication tasks. This dataset and fine-tuning methodology aim to enhance the training and evaluation of LLMs, contributing to advancements in LLMs for wireless communication research and applications.

[LG-8] Weight for Robustness: A Comprehensive Approach towards Optimal Fault-Tolerant Asynchronous ML

链接: https://arxiv.org/abs/2501.09621
作者: Tehila Dahan,Kfir Y. Levy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the challenges of Byzantine-robust training in asynchronous distributed machine learning systems, aiming to enhance efficiency amid massive parallelization and heterogeneous computing resources. Asynchronous systems, marked by independently operating workers and intermittent updates, uniquely struggle with maintaining integrity against Byzantine failures, which encompass malicious or erroneous actions that disrupt learning. The inherent delays in such settings not only introduce additional bias to the system but also obscure the disruptions caused by Byzantine faults. To tackle these issues, we adapt the Byzantine framework to asynchronous dynamics by introducing a novel weighted robust aggregation framework. This allows for the extension of robust aggregators and a recent meta-aggregator to their weighted versions, mitigating the effects of delayed updates. By further incorporating a recent variance-reduction technique, we achieve an optimal convergence rate for the first time in an asynchronous Byzantine environment. Our methodology is rigorously validated through empirical and theoretical analysis, demonstrating its effectiveness in enhancing fault tolerance and optimizing performance in asynchronous ML systems.

[LG-9] ARMAX identification of low rank graphical models

链接: https://arxiv.org/abs/2501.09616
作者: Wenqi Cao,Aming Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In large-scale systems, complex internal relationships are often present. Such interconnected systems can be effectively described by low rank stochastic processes. When identifying a predictive model of low rank processes from sampling data, the rank-deficient property of spectral densities is often obscured by the inevitable measurement noise in practice. However, existing low rank identification approaches often did not take noise into explicit consideration, leading to non-negligible inaccuracies even under weak noise. In this paper, we address the identification issue of low rank processes under measurement noise. We find that the noisy measurement model admits a sparse plus low rank structure in latent-variable graphical models. Specifically, we first decompose the problem into a maximum entropy covariance extension problem, and a low rank graphical estimation problem based on an autoregressive moving-average with exogenous input (ARMAX) model. To identify the ARMAX low rank graphical models, we propose an estimation approach based on maximum likelihood. The identifiability and consistency of this approach are proven under certain conditions. Simulation results confirm the reliable performance of the entire algorithm in both the parameter estimation and noisy data filtering.

[LG-10] EVaDE : Event-Based Variational Thompson Sampling for Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2501.09611
作者: Siddharth Aravindan,Dixant Mittal,Wee Sun Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Posterior Sampling for Reinforcement Learning (PSRL) is a well-known algorithm that augments model-based reinforcement learning (MBRL) algorithms with Thompson sampling. PSRL maintains posterior distributions of the environment transition dynamics and the reward function, which are intractable for tasks with high-dimensional state and action spaces. Recent works show that dropout, used in conjunction with neural networks, induces variational distributions that can approximate these posteriors. In this paper, we propose Event-based Variational Distributions for Exploration (EVaDE), which are variational distributions that are useful for MBRL, especially when the underlying domain is object-based. We leverage the general domain knowledge of object-based domains to design three types of event-based convolutional layers to direct exploration. These layers rely on Gaussian dropouts and are inserted between the layers of the deep neural network model to help facilitate variational Thompson sampling. We empirically show the effectiveness of EVaDE-equipped Simulated Policy Learning (EVaDE-SimPLe) on the 100K Atari game suite.

[LG-11] Adversarial-Ensemble Kolmogorov Arnold Networks for Enhancing Indoor Wi-Fi Positioning: A Defensive Approach Against Spoofing and Signal Manipulation Attacks

链接: https://arxiv.org/abs/2501.09609
作者: Mitul Goswami,Romit Chatterjee,Somnath Mahato,Prasant Kumar Pattnaik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The research presents a study on enhancing the robustness of Wi-Fi-based indoor positioning systems against adversarial attacks. The goal is to improve the positioning accuracy and resilience of these systems under two attack scenarios: Wi-Fi Spoofing and Signal Strength Manipulation. Three models are developed and evaluated: a baseline model (M_Base), an adversarially trained robust model (M_Rob), and an ensemble model (M_Ens). All models utilize a Kolmogorov-Arnold Network (KAN) architecture. The robust model is trained with adversarially perturbed data, while the ensemble model combines predictions from both the base and robust models. Experimental results show that the robust model reduces positioning error by approximately 10% compared to the baseline, achieving 2.03 meters error under Wi-Fi spoofing and 2.00 meters under signal strength manipulation. The ensemble model further outperforms with errors of 2.01 meters and 1.975 meters for the respective attack types. This analysis highlights the effectiveness of adversarial training techniques in mitigating attack impacts. The findings underscore the importance of considering adversarial scenarios in developing indoor positioning systems, as improved resilience can significantly enhance the accuracy and reliability of such systems in mission-critical environments.

[LG-12] Metrics for Inter-Dataset Similarity with Example Applications in Synthetic Data and Feature Selection Evaluation - Extended Version SDM

链接: https://arxiv.org/abs/2501.09591
作者: Muhammad Rajabinasab,Anton D. Lautrup,Arthur Zimek
类目: Machine Learning (cs.LG)
*备注: This is the extended version of a paper accepted at 2025 SIAM International Conference on Data Mining (SDM)

点击查看摘要

Abstract:Measuring inter-dataset similarity is an important task in machine learning and data mining with various use cases and applications. Existing methods for measuring inter-dataset similarity are computationally expensive, limited, or sensitive to different entities and non-trivial choices for parameters. They also lack a holistic perspective on the entire dataset. In this paper, we propose two novel metrics for measuring inter-dataset similarity. We discuss the mathematical foundation and the theoretical basis of our proposed metrics. We demonstrate the effectiveness of the proposed metrics by investigating two applications in the evaluation of synthetic data and in the evaluation of feature selection methods. The theoretical and empirical studies conducted in this paper illustrate the effectiveness of the proposed metrics.

[LG-13] Atleus: Accelerating Transformers on the Edge Enabled by 3D Heterogeneous Manycore Architectures

链接: https://arxiv.org/abs/2501.09588
作者: Pratyush Dhingra,Janardhan Rao Doppa,Partha Pratim Pande
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted for Publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)

点击查看摘要

Abstract:Transformer architectures have become the standard neural network model for various machine learning applications including natural language processing and computer vision. However, the compute and memory requirements introduced by transformer models make them challenging to adopt for edge applications. Furthermore, fine-tuning pre-trained transformers (e.g., foundation models) is a common task to enhance the model’s predictive performance on specific tasks/applications. Existing transformer accelerators are oblivious to complexities introduced by fine-tuning. In this paper, we propose the design of a three-dimensional (3D) heterogeneous architecture referred to as Atleus that incorporates heterogeneous computing resources specifically optimized to accelerate transformer models for the dual purposes of fine-tuning and inference. Specifically, Atleus utilizes non-volatile memory and systolic array for accelerating transformer computational kernels using an integrated 3D platform. Moreover, we design a suitable NoC to achieve high performance and energy efficiency. Finally, Atleus adopts an effective quantization scheme to support model compression. Experimental results demonstrate that Atleus outperforms existing state-of-the-art by up to 56x and 64.5x in terms of performance and energy efficiency respectively

[LG-14] Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

链接: https://arxiv.org/abs/2501.09556
作者: Jakub Kopal,Michal Gregor,Santiago de Leon-Martinez,Jakub Simko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Overshoot is a novel, momentum-based stochastic gradient descent optimization method designed to enhance performance beyond standard and Nesterov’s momentum. In conventional momentum methods, gradients from previous steps are aggregated with the gradient at current model weights before taking a step and updating the model. Rather than calculating gradient at the current model weights, Overshoot calculates the gradient at model weights shifted in the direction of the current momentum. This sacrifices the immediate benefit of using the gradient w.r.t. the exact model weights now, in favor of evaluating at a point, which will likely be more relevant for future updates. We show that incorporating this principle into momentum-based optimizers (SGD with momentum and Adam) results in faster convergence (saving on average at least 15% of steps). Overshoot consistently outperforms both standard and Nesterov’s momentum across a wide range of tasks and integrates into popular momentum-based optimizers with zero memory and small computational overhead.

[LG-15] Intra-day Solar and Power Forecast for Optimization of Intraday Market Participation

链接: https://arxiv.org/abs/2501.09551
作者: Nelson Salazar-Peña,Adolfo Palma-Vergara,Mateo Montes,María Alejandra Vargas-Torres,Adriana Salinas,Andrés Velasco,Alejandra Tabares,Andrés González-Mancera
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 20 pages, 37 figures, 9 tables

点击查看摘要

Abstract:The prediction of solar irradiance enhances reliability in photovoltaic (PV) solar plant generation and grid integration. In Colombia, PV plants face penalties if energy production deviates beyond governmental thresholds from intraday market offers. This research employs Long Short-Term Memory (LSTM) and Bidirectional-LSTM (Bi-LSTM) models, utilizing meteorological data from a PV plant in El Paso, Cesar, Colombia, to predict solar irradiance with a 6-hour horizon and 10-minute resolution. While Bi-LSTM showed superior performance, the LSTM model achieved comparable results with significantly reduced training time (6 hours versus 18 hours), making it computationally advantageous. The LSTM predictions were averaged to create an hourly resolution model, evaluated using Mean Absolute Error, Root-Mean-Square Error, Normalized Root-Mean-Square Error, and Mean Absolute Percentage Error metrics. Comparison with the Global Forecast System (GFS) revealed similar performance, with both models effectively capturing daily solar irradiance patterns. The forecast model integrates with an Object-Oriented power production model, enabling accurate energy offers in the intraday market while minimizing penalty costs.

[LG-16] MOGNET: A Mux-residual quantized Network leverag ing Online-Generated weights

链接: https://arxiv.org/abs/2501.09531
作者: Van Thien Nguyen,William Guicquero,Gilles Sicard
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Published at IEEE AICAS 2022

点击查看摘要

Abstract:This paper presents a compact model architecture called MOGNET, compatible with a resource-limited hardware. MOGNET uses a streamlined Convolutional factorization block based on a combination of 2 point-wise (1x1) convolutions with a group-wise convolution in-between. To further limit the overall model size and reduce the on-chip required memory, the second point-wise convolution’s parameters are on-line generated by a Cellular Automaton structure. In addition, MOGNET enables the use of low-precision weights and activations, by taking advantage of a Multiplexer mechanism with a proper Bitshift rescaling for integrating residual paths without increasing the hardware-related complexity. To efficiently train this model we also introduce a novel weight ternarization method favoring the balance between quantized levels. Experimental results show that given tiny memory budget (sub-2Mb), MOGNET can achieve higher accuracy with a clear gap up to 1% at a similar or even lower model size compared to recent state-of-the-art methods.

[LG-17] Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

链接: https://arxiv.org/abs/2501.09522
作者: Anke Tang,Enneng Yang,Li Shen,Yong Luo,Han Hu,Bo Du,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their specialized capabilities across different tasks and domains. Current model merging techniques focus on merging all available models simultaneously, with weight interpolation-based methods being the predominant approaches. However, these conventional approaches are not well-suited for scenarios where models become available sequentially, and they often suffer from high memory requirements and potential interference between tasks. In this study, we propose a training-free projection-based continual merging method that processes models sequentially through orthogonal projections of weight matrices and adaptive scaling mechanisms. Our method operates by projecting new parameter updates onto subspaces orthogonal to existing merged parameter updates while using an adaptive scaling mechanism to maintain stable parameter distances, enabling efficient sequential integration of task-specific knowledge. Our approach maintains constant memory complexity to the number of models, minimizes interference between tasks through orthogonal projections, and retains the performance of previously merged models through adaptive task vector scaling. Extensive experiments on CLIP-ViT models demonstrate that our method achieves a 5-8% average accuracy improvement while maintaining robust performance in different task orderings.

[LG-18] Multimodal Marvels of Deep Learning in Medical Diagnosis: A Comprehensive Review of COVID-19 Detection

链接: https://arxiv.org/abs/2501.09506
作者: Md Shofiqul Islama,Khondokar Fida Hasanc,Hasibul Hossain Shajeebd,Humayan Kabir Ranae,Md Saifur Rahmand,Md Munirul Hasanb,AKM Azadf,Ibrahim Abdullahg,Mohammad Ali Moni
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
*备注: 43 pages

点击查看摘要

[LG-19] Utilizing AI Language Models to Identify Prognostic Factors for Coronary Artery Disease: A Study in Mashhad Residents

链接: https://arxiv.org/abs/2501.09480
作者: Bami Zahra,Behnampour Nasser,Doosti Hassan,Ghayour Mobarhan Majid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Abstract: Background: Understanding cardiovascular artery disease risk factors, the leading global cause of mortality, is crucial for influencing its etiology, prevalence, and treatment. This study aims to evaluate prognostic markers for coronary artery disease in Mashhad using Naive Bayes, REP Tree, J48, CART, and CHAID algorithms. Methods: Using data from the 2009 MASHAD STUDY, prognostic factors for coronary artery disease were determined with Naive Bayes, REP Tree, J48, CART, CHAID, and Random Forest algorithms using R 3.5.3 and WEKA 3.9.4. Model efficiency was compared by sensitivity, specificity, and accuracy. Cases were patients with coronary artery disease; each had three controls (totally 940). Results: Prognostic factors for coronary artery disease in Mashhad residents varied by algorithm. CHAID identified age, myocardial infarction history, and hypertension. CART included depression score and physical activity. REP added education level and anxiety score. NB included diabetes and family history. J48 highlighted father’s heart disease and weight loss. CHAID had the highest accuracy (0.80). Conclusion: Key prognostic factors for coronary artery disease in CART and CHAID models include age, myocardial infarction history, hypertension, depression score, physical activity, and BMI. NB, REP Tree, and J48 identified numerous factors. CHAID had the highest accuracy, sensitivity, and specificity. CART offers simpler interpretation, aiding physician and paramedic model selection based on specific. Keywords: RF, Naïve Bayes, REP, J48 algorithms, Coronary Artery Disease (CAD). Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.09480 [cs.LG] (or arXiv:2501.09480v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.09480 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zahra Bami [view email] [v1] Thu, 16 Jan 2025 11:32:03 UTC (516 KB)

[LG-20] Pruning for Sparse Diffusion Models based on Gradient Flow ICASSP2025

链接: https://arxiv.org/abs/2501.09464
作者: Ben Wan,Tianyi Zheng,Zhaoyu Chen,Yuxiao Wang,Jia Wang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, accepted by ICASSP2025

点击查看摘要

Abstract:Diffusion Models (DMs) have impressive capabilities among generation models, but are limited to slower inference speeds and higher computational costs. Previous works utilize one-shot structure pruning to derive lightweight DMs from pre-trained ones, but this approach often leads to a significant drop in generation quality and may result in the removal of crucial weights. Thus we propose a iterative pruning method based on gradient flow, including the gradient flow pruning process and the gradient flow pruning criterion. We employ a progressive soft pruning strategy to maintain the continuity of the mask matrix and guide it along the gradient flow of the energy function based on the pruning criterion in sparse space, thereby avoiding the sudden information loss typically caused by one-shot pruning. Gradient-flow based criterion prune parameters whose removal increases the gradient norm of loss function and can enable fast convergence for a pruned model in iterative pruning stage. Our extensive experiments on widely used datasets demonstrate that our method achieves superior performance in efficiency and consistency with pre-trained models.

[LG-21] aching Wav2Vec2 the Language of the Brain ICASSP2025

链接: https://arxiv.org/abs/2501.09459
作者: Tobias Fiedler,Leon Hermann,Florian Müller,Sarel Cohen,Peter Chin,Tobias Friedrich,Eilon Vaadia
类目: Machine Learning (cs.LG)
*备注: Paper was submitted to ICASSP 2025 but marginally rejected

点击查看摘要

Abstract:The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ‘‘from scratch’’ without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54%, outperforming the best training from scratch run by 20.46% and that of frozen Wav2Vec2 training by 15.92% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at this https URL.

[LG-22] FASP: Fast and Accurate Structured Pruning of Large Language Models

链接: https://arxiv.org/abs/2501.09412
作者: Hanyu Hu,Pengxiang Zhao,Ping Li,Yi Zheng,Zhefeng Wang,Xiaoming Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid increase in the size of large language models (LLMs) has significantly escalated their computational and memory demands, posing challenges for efficient deployment, especially on resource-constrained devices. Structured pruning has emerged as an effective model compression method that can reduce these demands while preserving performance. In this paper, we introduce FASP (Fast and Accurate Structured Pruning), a novel structured pruning framework for LLMs that emphasizes both speed and accuracy. FASP employs a distinctive pruning structure that interlinks sequential layers, allowing for the removal of columns in one layer while simultaneously eliminating corresponding rows in the preceding layer without incurring additional performance loss. The pruning metric, inspired by Wanda, is computationally efficient and effectively selects components to prune. Additionally, we propose a restoration mechanism that enhances model fidelity by adjusting the remaining weights post-pruning. We evaluate FASP on the OPT and LLaMA model families, demonstrating superior performance in terms of perplexity and accuracy on downstream tasks compared to state-of-the-art methods. Our approach achieves significant speed-ups, pruning models such as OPT-125M in 17 seconds and LLaMA-30B in 15 minutes on a single NVIDIA RTX 4090 GPU, making it a highly practical solution for optimizing LLMs.

[LG-23] Fast Searching of Extreme Operating Conditions for Relay Protection Setting Calculation Based on Graph Neural Network and Reinforcement Learning

链接: https://arxiv.org/abs/2501.09399
作者: Yan Li,Jingyu Wang,Jiankang Zhang,Huaiqiang Li,Longfei Ren,Yinhong Li,Dongyuan Shi,Xianzhong Duan
类目: Machine Learning (cs.LG)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Searching for the Extreme Operating Conditions (EOCs) is one of the core problems of power system relay protection setting calculation. The current methods based on brute-force search, heuristic algorithms, and mathematical programming can hardly meet the requirements of today’s power systems in terms of computation speed due to the drastic changes in operating conditions induced by renewables and power electronics. This paper proposes an EOC fast search method, named Graph Dueling Double Deep Q Network (Graph D3QN), which combines graph neural network and deep reinforcement learning to address this challenge. First, the EOC search problem is modeled as a Markov decision process, where the information of the underlying power system is extracted using graph neural networks, so that the EOC of the system can be found via deep reinforcement learning. Then, a two-stage Guided Learning and Free Exploration (GLFE) training framework is constructed to accelerate the convergence speed of reinforcement learning. Finally, the proposed Graph D3QN method is validated through case studies of searching maximum fault current for relay protection setting calculation on the IEEE 39-bus and 118-bus systems. The experimental results demonstrate that Graph D3QN can reduce the computation time by 10 to 1000 times while guaranteeing the accuracy of the selected EOCs.

[LG-24] PAL: Prompting Analytic Learning with Missing Modality for Multi-Modal Class-Incremental Learning

链接: https://arxiv.org/abs/2501.09352
作者: Xianghu Yue,Yiming Chen,Xueyi Zhang,Xiaoxue Gao,Mengling Feng,Mingrui Lao,Huiping Zhuang,Haizhou Li
类目: Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Multi-modal class-incremental learning (MMCIL) seeks to leverage multi-modal data, such as audio-visual and image-text pairs, thereby enabling models to learn continuously across a sequence of tasks while mitigating forgetting. While existing studies primarily focus on the integration and utilization of multi-modal information for MMCIL, a critical challenge remains: the issue of missing modalities during incremental learning phases. This oversight can exacerbate severe forgetting and significantly impair model performance. To bridge this gap, we propose PAL, a novel exemplar-free framework tailored to MMCIL under missing-modality scenarios. Concretely, we devise modality-specific prompts to compensate for missing information, facilitating the model to maintain a holistic representation of the data. On this foundation, we reformulate the MMCIL problem into a Recursive Least-Squares task, delivering an analytical linear solution. Building upon these, PAL not only alleviates the inherent under-fitting limitation in analytic learning but also preserves the holistic representation of missing-modality data, achieving superior performance with less forgetting across various multi-modal incremental scenarios. Extensive experiments demonstrate that PAL significantly outperforms competitive methods across various datasets, including UPMC-Food101 and N24News, showcasing its robustness towards modality absence and its anti-forgetting ability to maintain high incremental accuracy.

[LG-25] Identifying Information from Observations with Uncertainty and Novelty

链接: https://arxiv.org/abs/2501.09331
作者: Derek S. Prijatelj(1),Timothy J. Ireland(2),Walter J. Scheirer(1) ((1) University of Notre Dame, (2) Independent Researcher)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 1 figure, 1 table, and 2 inline algorithms. Submitted to JMLR Jan. 6, 2025

点击查看摘要

Abstract:A machine learning tasks from observations must encounter and process uncertainty and novelty, especially when it is expected to maintain performance when observing new information and to choose the best fitting hypothesis to the currently observed information. In this context, some key questions arise: what is information, how much information did the observations provide, how much information is required to identify the data-generating process, how many observations remain to get that information, and how does a predictor determine that it has observed novel information? This paper strengthens existing answers to these questions by formalizing the notion of “identifiable information” that arises from the language used to express the relationship between distinct states. Model identifiability and sample complexity are defined via computation of an indicator function over a set of hypotheses. Their properties and asymptotic statistics are described for data-generating processes ranging from deterministic processes to ergodic stationary stochastic processes. This connects the notion of identifying information in finite steps with asymptotic statistics and PAC-learning. The indicator function’s computation naturally formalizes novel information and its identification from observations with respect to a hypothesis set. We also proved that computable PAC-Bayes learners’ sample complexity distribution is determined by its moments in terms of the the prior probability distribution over a fixed finite hypothesis set.

[LG-26] Cooperative Decentralized Backdoor Attacks on Vertical Federated Learning

链接: https://arxiv.org/abs/2501.09320
作者: Seohyun Lee,Wenzhi Fang,Anindya Bijoy Das,Seyyedali Hosseinalipour,David J. Love,Christopher G. Brinton
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This paper is currently under review in the IEEE/ACM Transactions on Networking Special Issue on AI and Networking

点击查看摘要

Abstract:Federated learning (FL) is vulnerable to backdoor attacks, where adversaries alter model behavior on target classification labels by embedding triggers into data samples. While these attacks have received considerable attention in horizontal FL, they are less understood for vertical FL (VFL), where devices hold different features of the samples, and only the server holds the labels. In this work, we propose a novel backdoor attack on VFL which (i) does not rely on gradient information from the server and (ii) considers potential collusion among multiple adversaries for sample selection and trigger embedding. Our label inference model augments variational autoencoders with metric learning, which adversaries can train locally. A consensus process over the adversary graph topology determines which datapoints to poison. We further propose methods for trigger splitting across the adversaries, with an intensity-based implantation scheme skewing the server towards the trigger. Our convergence analysis reveals the impact of backdoor perturbations on VFL indicated by a stationarity gap for the trained model, which we verify empirically as well. We conduct experiments comparing our attack with recent backdoor VFL approaches, finding that ours obtains significantly higher success rates for the same main task performance despite not using server information. Additionally, our results verify the impact of collusion on attack performance.

[LG-27] Physics-informed deep learning for infectious disease forecasting

链接: https://arxiv.org/abs/2501.09298
作者: Ying Qian,Éric Marty,Avranil Basu,Eamon B. O’Dea,Xianqiao Wang,Spencer Fox,Pejman Rohani,John M. Drake,He Li
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Accurate forecasting of contagious illnesses has become increasingly important to public health policymaking, and better prediction could prevent the loss of millions of lives. To better prepare for future pandemics, it is essential to improve forecasting methods and capabilities. In this work, we propose a new infectious disease forecasting model based on physics-informed neural networks (PINNs), an emerging area of scientific machine learning. The proposed PINN model incorporates dynamical systems representations of disease transmission into the loss function, thereby assimilating epidemiological theory and data using neural networks (NNs). Our approach is designed to prevent model overfitting, which often occurs when training deep learning models with observation data alone. In addition, we employ an additional sub-network to account for mobility, vaccination, and other covariates that influence the transmission rate, a key parameter in the compartment model. To demonstrate the capability of the proposed model, we examine the performance of the model using state-level COVID-19 data in California. Our simulation results show that predictions of PINN model on the number of cases, deaths, and hospitalizations are consistent with existing benchmarks. In particular, the PINN model outperforms the basic NN model and naive baseline forecast. We also show that the performance of the PINN model is comparable to a sophisticated Gaussian infection state space with time dependence (GISST) forecasting model that integrates the compartment model with a data observation model and a regression model for inferring parameters in the compartment model. Nonetheless, the PINN model offers a simpler structure and is easier to implement. Our results show that the proposed forecaster could potentially serve as a new computational tool to enhance the current capacity of infectious disease forecasting.

[LG-28] Free-Knots Kolmogorov-Arnold Network: On the Analysis of Spline Knots and Advancing Stability

链接: https://arxiv.org/abs/2501.09283
作者: Liangwewi Nathan Zheng,Wei Emma Zhang,Lin Yue,Miao Xu,Olaf Maennel,Weitong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Neural Networks (KANs) have gained significant attention in the machine learning community. However, their implementation often suffers from poor training stability and heavy trainable parameter. Furthermore, there is limited understanding of the behavior of the learned activation functions derived from B-splines. In this work, we analyze the behavior of KANs through the lens of spline knots and derive the lower and upper bound for the number of knots in B-spline-based KANs. To address existing limitations, we propose a novel Free Knots KAN that enhances the performance of the original KAN while reducing the number of trainable parameters to match the trainable parameter scale of standard Multi-Layer Perceptrons (MLPs). Additionally, we introduce new a training strategy to ensure C^2 continuity of the learnable spline, resulting in smoother activation compared to the original KAN and improve the training stability by range expansion. The proposed method is comprehensively evaluated on 8 datasets spanning various domains, including image, text, time series, multimodal, and function approximation tasks. The promising results demonstrates the feasibility of KAN-based network and the effectiveness of proposed method.

[LG-29] ask Vectors in In-Context Learning: Emergence Formation and Benefit

链接: https://arxiv.org/abs/2501.09240
作者: Liu Yang,Ziqian Lin,Kangwook Lee,Dimitris Papailiopoulos,Robert Nowak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning is a remarkable capability of transformers, referring to their ability to adapt to specific tasks based on a short history or context. Previous research has found that task-specific information is locally encoded within models, though their emergence and functionality remain unclear due to opaque pre-training processes. In this work, we investigate the formation of task vectors in a controlled setting, using models trained from scratch on synthetic datasets. Our findings confirm that task vectors naturally emerge under certain conditions, but the tasks may be relatively weakly and/or non-locally encoded within the model. To promote strong task vectors encoded at a prescribed location within the model, we propose an auxiliary training mechanism based on a task vector prompting loss (TVP-loss). This method eliminates the need to search for task-correlated encodings within the trained model and demonstrably improves robustness and generalization.

[LG-30] Mono-Forward: Backpropagation-Free Algorithm for Efficient Neural Network Training Harnessing Local Errors

链接: https://arxiv.org/abs/2501.09238
作者: James Gong,Bruce Li,Waleed Abdulla
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Backpropagation is the standard method for achieving state-of-the-art accuracy in neural network training, but it often imposes high memory costs and lacks biological plausibility. In this paper, we introduce the Mono-Forward algorithm, a purely local layerwise learning method inspired by Hinton’s Forward-Forward framework. Unlike backpropagation, Mono-Forward optimizes each layer solely with locally available information, eliminating the reliance on global error signals. We evaluated Mono-Forward on multi-layer perceptrons and convolutional neural networks across multiple benchmarks, including MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. The test results show that Mono-Forward consistently matches or surpasses the accuracy of backpropagation across all tasks, with significantly reduced and more even memory usage, better parallelizability, and a comparable convergence rate.

[LG-31] ssellated Linear Model for Age Prediction from Voice

链接: https://arxiv.org/abs/2501.09229
作者: Dareen Alharthi,Mahsa Zamani,Bhiksha Raj,Rita Singh
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Voice biometric tasks, such as age estimation require modeling the often complex relationship between voice features and the biometric variable. While deep learning models can handle such complexity, they typically require large amounts of accurately labeled data to perform well. Such data are often scarce for biometric tasks such as voice-based age prediction. On the other hand, simpler models like linear regression can work with smaller datasets but often fail to generalize to the underlying non-linear patterns present in the data. In this paper we propose the Tessellated Linear Model (TLM), a piecewise linear approach that combines the simplicity of linear models with the capacity of non-linear functions. TLM tessellates the feature space into convex regions and fits a linear model within each region. We optimize the tessellation and the linear models using a hierarchical greedy partitioning. We evaluated TLM on the TIMIT dataset on the task of age prediction from voice, where it outperformed state-of-the-art deep learning models.

[LG-32] sting Noise Assumptions of Learning Algorithms

链接: https://arxiv.org/abs/2501.09189
作者: Surbhi Goel,Adam R. Klivans,Konstantinos Stavropoulos,Arsen Vasilyan
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We pose a fundamental question in computational learning theory: can we efficiently test whether a training set satisfies the assumptions of a given noise model? This question has remained unaddressed despite decades of research on learning in the presence of noise. In this work, we show that this task is tractable and present the first efficient algorithm to test various noise assumptions on the training data. To model this question, we extend the recently proposed testable learning framework of Rubinfeld and Vasilyan (2023) and require a learner to run an associated test that satisfies the following two conditions: (1) whenever the test accepts, the learner outputs a classifier along with a certificate of optimality, and (2) the test must pass for any dataset drawn according to a specified modeling assumption on both the marginal distribution and the noise model. We then consider the problem of learning halfspaces over Gaussian marginals with Massart noise (where each label can be flipped with probability less than 1/2 depending on the input features), and give a fully-polynomial time testable learning algorithm. We also show a separation between the classical setting of learning in the presence of structured noise and testable learning. In fact, for the simple case of random classification noise (where each label is flipped with fixed probability \eta = 1/2 ), we show that testable learning requires super-polynomial time while classical learning is trivial. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2501.09189 [cs.LG] (or arXiv:2501.09189v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.09189 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Enhancing Graph Representation Learning with Localized Topological Features

链接: https://arxiv.org/abs/2501.09178
作者: Zuoyu Yan,Qi Zhao,Ze Ye,Tengfei Ma,Liangcai Gao,Zhi Tang,Yusu Wang,Chao Chen
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted in JMLR 2025

点击查看摘要

Abstract:Representation learning on graphs is a fundamental problem that can be crucial in various tasks. Graph neural networks, the dominant approach for graph representation learning, are limited in their representation power. Therefore, it can be beneficial to explicitly extract and incorporate high-order topological and geometric information into these models. In this paper, we propose a principled approach to extract the rich connectivity information of graphs based on the theory of persistent homology. Our method utilizes the topological features to enhance the representation learning of graph neural networks and achieve state-of-the-art performance on various node classification and link prediction benchmarks. We also explore the option of end-to-end learning of the topological features, i.e., treating topological computation as a differentiable operator during learning. Our theoretical analysis and empirical study provide insights and potential guidelines for employing topological features in graph learning tasks.

[LG-34] owards Federated Multi-Armed Bandit Learning for Content Dissemination using Swarm of UAVs

链接: https://arxiv.org/abs/2501.09146
作者: Amit Kumar Bhuyan,Hrishikesh Dutta,Subir Biswas
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 25 pages, 11 figures, 1 table, 4 algorithms, journal

点击查看摘要

Abstract:This paper introduces an Unmanned Aerial Vehicle - enabled content management architecture that is suitable for critical content access in communities of users that are communication-isolated during diverse types of disaster scenarios. The proposed architecture leverages a hybrid network of stationary anchor UAVs and mobile Micro-UAVs for ubiquitous content dissemination. The anchor UAVs are equipped with both vertical and lateral communication links, and they serve local users, while the mobile micro-ferrying UAVs extend coverage across communities with increased mobility. The focus is on developing a content dissemination system that dynamically learns optimal caching policies to maximize content availability. The core innovation is an adaptive content dissemination framework based on distributed Federated Multi-Armed Bandit learning. The goal is to optimize UAV content caching decisions based on geo-temporal content popularity and user demand variations. A Selective Caching Algorithm is also introduced to reduce redundant content replication by incorporating inter-UAV information sharing. This method strategically preserves the uniqueness in user preferences while amalgamating the intelligence across a distributed learning system. This approach improves the learning algorithm’s ability to adapt to diverse user preferences. Functional verification and performance evaluation confirm the proposed architecture’s utility across different network sizes, UAV swarms, and content popularity patterns.

[LG-35] Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

链接: https://arxiv.org/abs/2501.09137
作者: Pierfrancesco Beneventano,Blake Woodworth
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 23 pages, 3 figures

点击查看摘要

Abstract:We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize–about 2/\textrmsharpness . It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability’', which induces additional regularization by delaying convergence and may have implications for training more complex models.

[LG-36] Multi-Class Traffic Assignment using Multi-View Heterogeneous Graph Attention Networks

链接: https://arxiv.org/abs/2501.09117
作者: Tong Liu,Hadi Meidani
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Solving traffic assignment problem for large networks is computationally challenging when conventional optimization-based methods are used. In our research, we develop an innovative surrogate model for a traffic assignment when multi-class vehicles are involved. We do so by employing heterogeneous graph neural networks which use a multiple-view graph attention mechanism tailored to different vehicle classes, along with additional links connecting origin-destination pairs. We also integrate the node-based flow conservation law into the loss function. As a result, our model adheres to flow conservation while delivering highly accurate predictions for link flows and utilization ratios. Through numerical experiments conducted on urban transportation networks, we demonstrate that our model surpasses traditional neural network approaches in convergence speed and predictive accuracy in both user equilibrium and system optimal versions of traffic assignment.

[LG-37] Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach

链接: https://arxiv.org/abs/2501.09107
作者: Alireza Ghaffari,Sharareh Younesian,Boxing Chen,Vahid Partovi Nia,Masoud Asgharian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly computationally complex, developing efficient deployment strategies, such as quantization, becomes crucial. State-of-the-art Post-training Quantization (PTQ) techniques often rely on calibration processes to maintain the accuracy of these models. However, while these calibration techniques can enhance performance in certain domains, they may not be as effective in others. This paper aims to draw attention to robust statistical approaches that can mitigate such issues. We propose a weight-adaptive PTQ method that can be considered a precursor to calibration-based PTQ methods, guiding the quantization process to preserve the distribution of weights by minimizing the Kullback-Leibler divergence between the quantized weights and the originally trained weights. This minimization ensures that the quantized model retains the Shannon information content of the original model to a great extent, guaranteeing robust and efficient deployment across many tasks. As such, our proposed approach can perform on par with most common calibration-based PTQ methods, establishing a new pre-calibration step for further adjusting the quantized weights with calibration. We show that our pre-calibration results achieve the same accuracy as some existing calibration-based PTQ methods on various LLMs.

[LG-38] Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

链接: https://arxiv.org/abs/2501.09103
作者: Karina Zadorozhny,Kangway V. Chuang,Bharath Sathappan,Ewan Wallace,Vishnu Sresht,Colin A. Grambow
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of molecular activities is crucial for efficient drug discovery, yet remains challenging due to limited and noisy datasets. We introduce Similarity-Quantized Relative Learning (SQRL), a learning framework that reformulates molecular activity prediction as relative difference learning between structurally similar pairs of compounds. SQRL uses precomputed molecular similarities to enhance training of graph neural networks and other architectures, and significantly improves accuracy and generalization in low-data regimes common in drug discovery. We demonstrate its broad applicability and real-world potential through benchmarking on public datasets as well as proprietary industry data. Our findings demonstrate that leveraging similarity-aware relative differences provides an effective paradigm for molecular activity prediction.

[LG-39] Physics-Informed Machine Learning for Microscale Drying of Plant-Based Foods: A Systematic Review of Computational Models and Experimental Insights

链接: https://arxiv.org/abs/2501.09034
作者: C. P. Batuwatta-Gamage(1),H. Jeong(1),HCP Karunasena(1 and 3),M. A. Karim(1),C.M. Rathnayaka(1 and 2),Y.T. Gu(1) ((1) Queensland University of Technology (QUT), Australia, (2) University of the Sunshine Coast (UniSC), Australia., (3) University of Ruhuna, Sri Lanka)
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:This review examines the current state of research on microscale cellular changes during the drying of plant-based food materials (PBFM), with particular emphasis on computational modelling approaches. The review addresses the critical need for advanced computational methods in microscale investigations. We systematically analyse experimental studies in PBFM drying, highlighting their contributions and limitations in capturing cellular-level phenomena, including challenges in data acquisition and measurement accuracy under varying drying conditions. The evolution of computational models for microstructural investigations is thoroughly examined, from traditional numerical methods to contemporary state-of-the-art approaches, with specific focus on their ability to handle the complex, nonlinear properties of plant cellular materials. Special attention is given to the emergence of data-driven models and their limitations in predicting microscale cellular behaviour during PBFM drying, particularly addressing challenges in dataset acquisition and model generalization. The review provides an in-depth analysis of Physics-Informed Machine Learning (PIML) frameworks, examining their theoretical foundations, current applications in related fields, and unique advantages in combining physical principles with neural network architectures. Through this comprehensive assessment, we identify critical gaps in existing methodologies, evaluate the trade-offs between different modelling approaches, and provide insights into future research directions for improving our understanding of cellular-level transformations during PBFM drying processes. The review concludes with recommendations for integrating experimental and computational approaches to advance the field of food preservation technology.

[LG-40] Random Subspace Cubic-Regularization Methods with Applications to Low-Rank Functions

链接: https://arxiv.org/abs/2501.09734
作者: Coralia Cartis,Zhen Shao,Edward Tansley
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We propose and analyze random subspace variants of the second-order Adaptive Regularization using Cubics (ARC) algorithm. These methods iteratively restrict the search space to some random subspace of the parameters, constructing and minimizing a local model only within this subspace. Thus, our variants only require access to (small-dimensional) projections of first- and second-order problem derivatives and calculate a reduced step inexpensively. Under suitable assumptions, the ensuing methods maintain the optimal first-order, and second-order, global rates of convergence of (full-dimensional) cubic regularization, while showing improved scalability both theoretically and numerically, particularly when applied to low-rank functions. When applied to the latter, our adaptive variant naturally adapts the subspace size to the true rank of the function, without knowing it a priori.

[LG-41] Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI

链接: https://arxiv.org/abs/2501.09731
作者: Wenlong Ji,Lihua Lei,Tijana Zrnic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We establish a formal connection between the decades-old surrogate outcome model in biostatistics and economics and the emerging field of prediction-powered inference (PPI). The connection treats predictions from pre-trained models, prevalent in the age of AI, as cost-effective surrogates for expensive outcomes. Building on the surrogate outcomes literature, we develop recalibrated prediction-powered inference, a more efficient approach to statistical inference than existing PPI proposals. Our method departs from the existing proposals by using flexible machine learning techniques to learn the optimal ``imputed loss’’ through a step we call recalibration. Importantly, the method always improves upon the estimator that relies solely on the data with available true outcomes, even when the optimal imputed loss is estimated imperfectly, and it achieves the smallest asymptotic variance among PPI estimators if the estimate is consistent. Computationally, our optimization objective is convex whenever the loss function that defines the target parameter is convex. We further analyze the benefits of recalibration, both theoretically and numerically, in several common scenarios where machine learning predictions systematically deviate from the outcome of interest. We demonstrate significant gains in effective sample size over existing PPI proposals via three applications leveraging state-of-the-art machine learning/AI models.

[LG-42] Rough kernel hedging

链接: https://arxiv.org/abs/2501.09683
作者: Nicola Muca Cirone,Cristopher Salvi
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Building on the functional-analytic framework of operator-valued kernels and un-truncated signature kernels, we propose a scalable, provably convergent signature-based algorithm for a broad class of high-dimensional, path-dependent hedging problems. We make minimal assumptions about market dynamics by modelling them as general geometric rough paths, yielding a fully model-free approach. Furthermore, through a representer theorem, we provide theoretical guarantees on the existence and uniqueness of a global minimum for the resulting optimization problem and derive an analytic solution under highly general loss functions. Similar to the popular deep hedging approach, but in a more rigorous fashion, our method can also incorporate additional features via the underlying operator-valued kernel, such as trading signals, news analytics, and past hedging decisions, closely aligning with true machine-learning practice.

[LG-43] owards Spectral Convergence of Locally Linear Embedding on Manifolds with Boundary

链接: https://arxiv.org/abs/2501.09572
作者: Andrew Lyons
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG)
*备注: 26 pages, 7 figures; the author welcomes all comments

点击查看摘要

Abstract:We study the eigenvalues and eigenfunctions of a differential operator that governs the asymptotic behavior of the unsupervised learning algorithm known as Locally Linear Embedding when a large data set is sampled from an interval or disc. In particular, the differential operator is of second order, mixed-type, and degenerates near the boundary. We show that a natural regularity condition on the eigenfunctions imposes a consistent boundary condition and use the Frobenius method to estimate pointwise behavior. We then determine the limiting sequence of eigenvalues analytically and compare them to numerical predictions. Finally, we propose a variational framework for determining eigenvalues on other compact manifolds.

[LG-44] Multi-task deep-learning for sleep event detection and stage classification

链接: https://arxiv.org/abs/2501.09519
作者: Adriana Anido-Alonso,Diego Alvarez-Estevez
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Polysomnographic sleep analysis is the standard clinical method to accurately diagnose and treat sleep disorders. It is an intricate process which involves the manual identification, classification, and location of multiple sleep event patterns. This is complex, for which identification of different types of events involves focusing on different subsets of signals, resulting on an iterative time-consuming process entailing several visual analysis passes. In this paper we propose a multi-task deep-learning approach for the simultaneous detection of sleep events and hypnogram construction in one single pass. Taking as reference state-of-the-art methodology for object-detection in the field of Computer Vision, we reformulate the problem for the analysis of multi-variate time sequences, and more specifically for pattern detection in the sleep analysis scenario. We investigate the performance of the resulting method in identifying different assembly combinations of EEG arousals, respiratory events (apneas and hypopneas) and sleep stages, also considering different input signal montage configurations. Furthermore, we evaluate our approach using two independent datasets, assessing true-generalization effects involving local and external validation scenarios. Based on our results, we analyze and discuss our method’s capabilities and its potential wide-range applicability across different settings and datasets.

[LG-45] Estimating shared subspace with AJIVE: the power and limitation of multiple data matrices

链接: https://arxiv.org/abs/2501.09336
作者: Yuepeng Yang,Cong Ma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Integrative data analysis often requires disentangling joint and individual variations across multiple datasets, a challenge commonly addressed by the Joint and Individual Variation Explained (JIVE) model. While numerous methods have been developed to estimate the shared subspace under JIVE, the theoretical understanding of their performance remains limited, particularly in the context of multiple matrices and varying levels of subspace misalignment. This paper bridges this gap by providing a systematic analysis of shared subspace estimation in multi-matrix settings. We focus on the Angle-based Joint and Individual Variation Explained (AJIVE) method, a two-stage spectral approach, and establish new performance guarantees that uncover its strengths and limitations. Specifically, we show that in high signal-to-noise ratio (SNR) regimes, AJIVE’s estimation error decreases with the number of matrices, demonstrating the power of multi-matrix integration. Conversely, in low-SNR settings, AJIVE exhibits a non-diminishing error, highlighting fundamental limitations. To complement these results, we derive minimax lower bounds, showing that AJIVE achieves optimal rates in high-SNR regimes. Furthermore, we analyze an oracle-aided spectral estimator to demonstrate that the non-diminishing error in low-SNR scenarios is a fundamental barrier. Extensive numerical experiments corroborate our theoretical findings, providing insights into the interplay between SNR, matrix count, and subspace misalignment. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2501.09336 [stat.ML] (or arXiv:2501.09336v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.09336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] On the convergence of noisy Bayesian Optimization with Expected Improvement

链接: https://arxiv.org/abs/2501.09262
作者: Jingyi Wang,Haowei Wang,Cosmin G. Petra,Nai-Yuan Chiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Expected improvement (EI) is one of the most widely-used acquisition functions in Bayesian optimization (BO). Despite its proven success in applications for decades, important open questions remain on the theoretical convergence behaviors and rates for EI. In this paper, we contribute to the convergence theories of EI in three novel and critical area. First, we consider objective functions that are under the Gaussian process (GP) prior assumption, whereas existing works mostly focus on functions in the reproducing kernel Hilbert space (RKHS). Second, we establish the first asymptotic error bound and its corresponding rate for GP-EI with noisy observations under the GP prior assumption. Third, by investigating the exploration and exploitation of the non-convex EI function, we prove improved error bounds for both the noise-free and noisy cases. The improved noiseless bound is extended to the RKHS assumption as well.

[LG-47] Generative AI Takes a Statistics Exam: A Comparison of Performance between ChatGPT 3.5 ChatGPT GPT 3.5 ChatGPT4 and ChatGPT4o-mini

链接: https://arxiv.org/abs/2501.09171
作者: Monnie McGee,Bivin Sadler
类目: Other Statistics (stat.OT); Machine Learning (cs.LG)
*备注: 24 pages, 2 figures, 3 tables. Submitted for publication August, 2024; revision submitted January 2025

点击查看摘要

Abstract:Many believe that use of generative AI as a private tutor has the potential to shrink access and achievement gaps between students and schools with abundant resources versus those with fewer resources. Shrinking the gap is possible only if paid and free versions of the platforms perform with the same accuracy. In this experiment, we investigate the performance of GPT versions 3.5, 4.0, and 4o-mini on the same 16-question statistics exam given to a class of first-year graduate students. While we do not advocate using any generative AI platform to complete an exam, the use of exam questions allows us to explore aspects of ChatGPT’s responses to typical questions that students might encounter in a statistics course. Results on accuracy indicate that GPT 3.5 would fail the exam, GPT4 would perform well, and GPT4o-mini would perform somewhere in between. While we acknowledge the existence of other Generative AI/LLMs, our discussion concerns only ChatGPT because it is the most widely used platform on college campuses at this time. We further investigate differences among the AI platforms in the answers for each problem using methods developed for text analytics, such as reading level evaluation and topic modeling. Results indicate that GPT3.5 and 4o-mini have characteristics that are more similar than either of them have with GPT4.

[LG-48] Generative diffusion model with inverse renormalization group flows

链接: https://arxiv.org/abs/2501.09064
作者: Kanta Masuki,Yuto Ashida
类目: atistical Mechanics (cond-mat.stat-mech); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Biological Physics (physics.bio-ph)
*备注: 9+21 pages, 4+11 figures. The code and trained models are available at this https URL

点击查看摘要

Abstract:Diffusion models represent a class of generative models that produce data by denoising a sample corrupted by white noise. Despite the success of diffusion models in computer vision, audio synthesis, and point cloud generation, so far they overlook inherent multiscale structures in data and have a slow generation process due to many iteration steps. In physics, the renormalization group offers a fundamental framework for linking different scales and giving an accurate coarse-grained model. Here we introduce a renormalization group-based diffusion model that leverages multiscale nature of data distributions for realizing a high-quality data generation. In the spirit of renormalization group procedures, we define a flow equation that progressively erases data information from fine-scale details to coarse-grained structures. Through reversing the renormalization group flows, our model is able to generate high-quality samples in a coarse-to-fine manner. We validate the versatility of the model through applications to protein structure prediction and image generation. Our model consistently outperforms conventional diffusion models across standard evaluation metrics, enhancing sample quality and/or accelerating sampling speed by an order of magnitude. The proposed method alleviates the need for data-dependent tuning of hyperparameters in the generative diffusion models, showing promise for systematically increasing sample efficiency based on the concept of the renormalization group.

[LG-49] Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

链接: https://arxiv.org/abs/2501.09052
作者: Shuang Cui,Yi Li,Jiangmeng Li,Xiongxin Tang,Bing Su,Fanjiang Xu,Hui Xiong
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single image defocus deblurring (SIDD) aims to restore an all-in-focus image from a defocused one. Distribution shifts in defocused images generally lead to performance degradation of existing methods during out-of-distribution inferences. In this work, we gauge the intrinsic reason behind the performance degradation, which is identified as the heterogeneity of lens-specific point spread functions. Empirical evidence supports this finding, motivating us to employ a continual test-time adaptation (CTTA) paradigm for SIDD. However, traditional CTTA methods, which primarily rely on entropy minimization, cannot sufficiently explore task-dependent information for pixel-level regression tasks like SIDD. To address this issue, we propose a novel Siamese networks-based continual test-time adaptation framework, which adapts source models to continuously changing target domains only requiring unlabeled target data in an online manner. To further mitigate semantically erroneous textures introduced by source SIDD models under severe degradation, we revisit the learning paradigm through a structural causal model and propose Causal Siamese networks (CauSiam). Our method leverages large-scale pre-trained vision-language models to derive discriminative universal semantic priors and integrates these priors into Siamese networks, ensuring causal identifiability between blurry inputs and restored images. Extensive experiments demonstrate that CauSiam effectively improves the generalization performance of existing SIDD methods in continuously changing domains.

[LG-50] Generative Models with ELBOs Converging to Entropy Sums

链接: https://arxiv.org/abs/2501.09022
作者: Jan Warnken,Dmytro Velychko,Simon Damm,Asja Fischer,Jörg Lücke
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 16 Pages

点击查看摘要

Abstract:The evidence lower bound (ELBO) is one of the most central objectives for probabilistic unsupervised learning. For the ELBOs of several generative models and model classes, we here prove convergence to entropy sums. As one result, we provide a list of generative models for which entropy convergence has been shown, so far, along with the corresponding expressions for entropy sums. Our considerations include very prominent generative models such as probabilistic PCA, sigmoid belief nets or Gaussian mixture models. However, we treat more models and entire model classes such as general mixtures of exponential family distributions. Our main contributions are the proofs for the individual models. For each given model we show that the conditions stated in Theorem 1 or Theorem 2 of [arXiv:2209.03077] are fulfilled such that by virtue of the theorems the given model’s ELBO is equal to an entropy sum at all stationary points. The equality of the ELBO at stationary points applies under realistic conditions: for finite numbers of data points, for model/data mismatches, at any stationary point including saddle points etc, and it applies for any well behaved family of variational distributions.

信息检索

[IR-0] Evaluating Conversational Recommender Systems with Large Language Models : A User-Centric Evaluation Framework

链接: https://arxiv.org/abs/2501.09493
作者: Nuo Chen,Quanyu Dai,Xiaoyu Dong,Xiao-Ming Wu,Zhenhua Dong
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conversational recommender systems (CRS) involve both recommendation and dialogue tasks, which makes their evaluation a unique challenge. Although past research has analyzed various factors that may affect user satisfaction with CRS interactions from the perspective of user studies, few evaluation metrics for CRS have been proposed. Recent studies have shown that LLMs can align with human preferences, and several LLM-based text quality evaluation measures have been introduced. However, the application of LLMs in CRS evaluation remains relatively limited. To address this research gap and advance the development of user-centric conversational recommender systems, this study proposes an automated LLM-based CRS evaluation framework, building upon existing research in human-computer interaction and psychology. The framework evaluates CRS from four dimensions: dialogue behavior, language expression, recommendation items, and response content. We use this framework to evaluate four different conversational recommender systems.

[IR-1] A Multi-tiered Solution for Personalized Baggage Item Recommendations using FastText and Association Rule Mining

链接: https://arxiv.org/abs/2501.09359
作者: Mudavath Ravi,Atul Negi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper introduces an intelligent baggage item recommendation system to optimize packing for air travelers by providing tailored suggestions based on specific travel needs and destinations. Using FastText word embeddings and Association Rule Mining (ARM), the system ensures efficient luggage space utilization, compliance with weight limits, and an enhanced travel experience. The methodology comprises four phases: (1) data collection and preprocessing with pre-trained FastText embeddings for text representation and similarity scoring (2) a content-based recommendation system enriched by user search history (3) application of ARM to user interactions to uncover meaningful item associations and (4) integration of FastText and ARM for accurate, personalized recommendations. Performance is evaluated using metrics such as coverage, support, confidence, lift, leverage, and conviction. Results demonstrate the system’s effectiveness in providing relevant suggestions, improving customer satisfaction, and simplifying the packing process. These insights advance personalized recommendations, targeted marketing, and product optimization in air travel and beyond.

[IR-2] Fuzzy Integration of Data Lake Tables

链接: https://arxiv.org/abs/2501.09211
作者: Aamod Khatiwada,Roee Shraga,Renée J. Miller
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Data integration is an important step in any data science pipeline where the objective is to unify the information available in different datasets for comprehensive analysis. Full Disjunction, which is an associative extension of the outer join operator, has been shown to be an effective operator for integrating datasets. It fully preserves and combines the available information. Existing Full Disjunction algorithms only consider the equi-join scenario where only tuples having the same value on joining columns are integrated. This, however, does not realistically represent an open data scenario, where datasets come from diverse sources with inconsistent values (e.g., synonyms, abbreviations, etc.) and with limited metadata. So, joining just on equal values severely limits the ability of Full Disjunction to fully combine datasets. Thus, in this work, we propose an extension of Full Disjunction to also account for “fuzzy” matches among tuples. We present a novel data-driven approach to enable the joining of approximate or fuzzy matches within Full Disjunction. Experimentally, we show that fuzzy Full Disjunction does not add significant time overhead over a state-of-the-art Full Disjunction implementation and also that it enhances the integration effectiveness.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-17

目录

概览 (2025-01-17)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载